

Data Engineering Podcast
Tobias Macey
This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.
Episodes
Mentioned books

Jul 13, 2021 • 49min
Exploring The Design And Benefits Of The Modern Data Stack
Summary
We have been building platforms and workflows to store, process, and analyze data since the earliest days of computing. Over that time there have been countless architectures, patterns, and "best practices" to make that task manageable. With the growing popularity of cloud services a new pattern has emerged and been dubbed the "Modern Data Stack". In this episode members of the GoDataDriven team, Guillermo Sanchez, Bram Ochsendorf, and Juan Perafan, explain the combinations of services that comprise this architecture, share their experiences working with clients to employ the stack, and the benefits of bringing engineers and business users together with data.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
You listen to this show to learn about all of the latest tools, patterns, and practices that power data engineering projects across every domain. Now there’s a book that captures the foundational lessons and principles that underly everything that you hear about here. I’m happy to announce I collected wisdom from the community to help you in your journey as a data engineer and worked with O’Reilly to publish it as 97 Things Every Data Engineer Should Know. Go to dataengineeringpodcast.com/97things today to get your copy!
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today.
We’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14-day trial.
Your host is Tobias Macey and today I’m interviewing Guillermo Sanchez, Bram Ochsendorf, and Juan Perafan about their experiences with managed services in the modern data stack in their work as consultants at GoDataDriven
Interview
Introduction
How did you get involved in the area of data management?
Can you start by giving your definition of the modern data stack?
What are the key characteristics of a tool or platform that make it a candidate for the "modern" stack?
How does the modern data stack shift the responsibilities and capabilities of data professionals and consumers?
What are some difficulties that you face when working with customers to migrate to these new architectures?
What are some of the limitations of the components or paradigms of the modern stack?
What are some strategies that you have devised for addressing those limitations?
What are some edge cases that you have run up against with specific vendors that you have had to work around?
What are the "gotchas" that you don’t run up against until you’ve deployed a service and started using it at scale and over time?
How does data governance get applied across the various services and systems of the modern stack?
One of the core promises of cloud-based and managed services for data is the ability for data analysts and consumers to self-serve. What kinds of training have you found to be necessary/useful for those end-users?
What is the role of data engineers in the context of the "modern" stack?
What are the most interesting, innovative, or unexpected manifestations of the modern data stack that you have seen?
What are the most interesting, unexpected, or challenging lessons that you have learned while working with customers to implement a modern data stack?
When is the modern data stack the wrong choice?
What new architectures or tools are you keeping an eye on for future client work?
Contact Info
Guillermo
LinkedIn
guillesd on GitHub
Bram
LinkedIn
bramochsendorf on GitHub
Juan
LinkedIn
jmperafan on GitHub
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
GoDataDriven
Deloitte
RPA == Robotic Process Automation
Analytics Engineer
James Webb Space Telescope
Fivetran
Podcast Episode
dbt
Podcast Episode
Data Governance
Podcast Episodes
Azure Cloud Platform
Stitch Data
Airflow
Prefect
Argo Project
Looker
Azure Purview
Soda Data
Podcast Episode
Datafold
Materialize
Podcast Episode
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Jul 9, 2021 • 1h 7min
Democratize Data Cleaning Across Your Organization With Trifacta
Summary
Every data project, whether it’s analytics, machine learning, or AI, starts with the work of data cleaning. This is a critical step and benefits from being accessible to the domain experts. Trifacta is a platform for managing your data engineering workflow to make curating, cleaning, and preparing your information more approachable for everyone in the business. In this episode CEO Adam Wilson shares the story behind the business, discusses the myriad ways that data wrangling is performed across the business, and how the platform is architected to adapt to the ever-changing landscape of data management tools. This is a great conversation about how deliberate user experience and platform design can make a drastic difference in the amount of value that a business can provide to their customers.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
You listen to this show to learn about all of the latest tools, patterns, and practices that power data engineering projects across every domain. Now there’s a book that captures the foundational lessons and principles that underly everything that you hear about here. I’m happy to announce I collected wisdom from the community to help you in your journey as a data engineer and worked with O’Reilly to publish it as 97 Things Every Data Engineer Should Know. Go to dataengineeringpodcast.com/97things today to get your copy!
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
Your host is Tobias Macey and today I’m interviewing Adam Wilson about Trifacta, a platform for modern data workers to assess quality, transform, and automate data pipelines
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Trifacta is and the story behind it?
Across your site and material you focus on using the term "data wrangling". What is your personal definition of that term, and in what ways do you differentiate from ETL/ELT?
How does the deliberate use of that terminology influence the way that you think about the design and features of the Trifacta platform?
What is Trifacta’s role in the overall data platform/data lifecycle for an organization?
What are some examples of tools that Trifacta might replace?
What tools or systems does Trifacta integrate with?
Who are the target end-users of the Trifacta platform and how do those personas direct the design and functionality?
Can you describe how Trifacta is architected?
How have the goals and design of the system changed or evolved since you first began working on it?
Can you talk through the workflow and lifecycle of data as it traverses your platform, and the user interactions that drive it?
How can data engineers share and encourage proper patterns for working with data assets with end-users across the organization?
What are the limits of scale for volume and complexity of data assets that users are able to manage through Trifacta’s visual tools?
What are some strategies that you and your customers have found useful for pre-processing the information that enters your platform to increase the accessibility for end-users to self-serve?
What are the most interesting, innovative, or unexpected ways that you have seen Trifacta used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Trifacata?
When is Trifacta the wrong choice?
What do you have planned for the future of Trifacta?
Contact Info
LinkedIn
@a_adam_wilson on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
Trifacta
Informatica
UC Berkeley
Stanford University
Citadel
Podcast Episode
Stanford Data Wrangler
DBT
Podcast Episode
Pig
Databricks
Sqoop
Flume
SPSS
Tableau
SDLC == Software Delivery Life-Cycle
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Jul 5, 2021 • 56min
Stick All Of Your Systems And Data Together With SaaSGlue As Your Workflow Manager
Summary
At the core of every data pipeline is an workflow manager (or several). Deploying, managing, and scaling that orchestration can consume a large fraction of a data team’s energy so it is important to pick something that provides the power and flexibility that you need. SaaSGlue is a managed service that lets you connect all of your systems, across clouds and physical infrastructure, and spanning all of your programming languages. In this episode Bart and Rich Wood explain how SaaSGlue is architected to allow for a high degree of flexibility in usage and deployment, their experience building a business with family, and how you can get started using it today. This is a fascinating platform with an endless set of use cases and a great team of people behind it.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today.
We’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14-day trial.
Your host is Tobias Macey and today I’m interviewing Rich and Bart Wood about SaasGlue, a SaaS-based integration, orchestration and automation platform that lets you fill the gaps in your existing automation infrastructure
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what SaasGlue is and the story behind it?
I understand that you are building this company with your 3 brothers. What have been the pros and cons of working with your family on this project?
What are the main use cases that you are focused on enabling?
Who are your target users and how has that influenced the features and design of the platform?
Orchestration, automation, and workflow management are all areas that have a range of active products and projects. How do you characterize SaaSGlue’s position in the overall ecosystem?
What are some of the ways that you see it integrated into a data platform?
What are the core elements and concepts of the SaaSGlue platform?
How is the SaaSGlue platform architected?
How have the goals and design of the platform changed or evolved since you first began working on it?
What are some of the assumptions that you had at the beginning of the project which have been challenged or changed as you worked through building it?
Can you talk through the workflow of someone building a task graph with SaaSGlue?
How do you handle dependency management for custom code in the payloads for agent tasks?
How does SaasGlue manage metadata propagation throughout the execution graph?
How do you handle the myriad failure modes that you are likely to encounter? (e.g. agent failure, network partitions, individual task failures, etc.)
What are some of the tools/platforms/architectural paradigms that you looked to for inspiration while designing and building SaaSGlue?
What are the most interesting, innovative, or unexpected ways that you have seen SaasGlue used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on SaasGlue?
When is SaaSGlue the wrong choice?
What do you have planned for the future of SaaSGlue?
Contact Info
Rich
LinkedIn
Bart
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
SaaSGlue
Jenkins
Cron
Airflow
Ansible
Terraform
DSL == Domain Specific Language
Clojure
Gradle
Polymorphism
Dagster
Podcast Episode
Podcast.__init__ Episode
Martin Kleppman
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Jul 3, 2021 • 1h 5min
Leveling Up Open Source Data Integration With Meltano Hub And The Singer SDK
Summary
Data integration in the form of extract and load is the critical first step of every data project. There are a large number of commercial and open source projects that offer that capability but it is still far from being a solved problem. One of the most promising community efforts is that of the Singer ecosystem, but it has been plagued by inconsistent quality and design of plugins. In this episode the members of the Meltano project share the work they are doing to improve the discovery, quality, and capabilities of Singer taps and targets. They explain their work on the Meltano Hub and the Singer SDK and their long term goals for the Singer community.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
Your host is Tobias Macey and today I’m interviewing Douwe Maan, Taylor Murphy, and AJ Steers about their work to level up the Singer ecosystem through projects like Meltano Hub and the Singer SDK
Interview
Introduction
How did you get involved in the area of data management?
Can you start by describing what the Singer ecosystem is?
What are the current weak points/challenges in the ecosystem?
What is the current role of the Meltano project/community within the ecosystem?
What are the projects and activities related to Singer that you are focused on?
What are the main goals of the Meltano Hub?
What criteria are you using to determine which projects to include in the hub?
Why is the number of targets so small?
What additional functionality do you have planned for the hub?
What functionality does the SDK provide?
How does the presence of the SDK make it easier to write taps/targets?
What do you believe the long-term impacts of the SDK on the overall availability and quality of plugins will be?
Now that you have spun out your own business and raised funding, how does that influence the priorities and focus of your work?
How do you hope to productize what you have built at Meltano?
What are the most interesting, innovative, or unexpected ways that you have seen Meltano and Singer plugins used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working with the Singer community and the Meltano project?
When is Singer/Meltano the wrong choice?
What do you have planned for the future of Meltano, Meltano Hub, and the Singer SDK?
Contact Info
Douwe
Website
Taylor
LinkedIn
@tayloramurphy on Twitter
Blog
AJ
LinkedIn
@aaronsteers on Twitter
aaronsteers on GitLab
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
Singer
Meltano
Podcast Episode
Meltano Hub
Singer SDK
Concert Genetics
GitLab
Snowflake
dbt
Podcast Episode
Microsoft SQL Server
Airflow
Podcast Episode
Dagster
Podcast Episode
Podcast.__init__ Episode
Prefect
Podcast Episode
AWS Athena
Reverse ETL
REST (REpresentational State Transfer)
GraphQL
Meltano Interpretation of Singer Specification
Vision for the Future of Meltano blog post
Coalesce Conference
Running Your Data Team Like A Product Team
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Jun 29, 2021 • 1h 6min
A Candid Exploration Of Timeseries Data Analysis With InfluxDB
Summary
While the overall concept of timeseries data is uniform, its usage and applications are far from it. One of the most demanding applications of timeseries data is for application and server monitoring due to the problem of high cardinality. In his quest to build a generalized platform for managing timeseries Paul Dix keeps getting pulled back into the monitoring arena. In this episode he shares the history of the InfluxDB project, the business that he has helped to build around it, and the architectural aspects of the engine that allow for its flexibility in managing various forms of timeseries data. This is a fascinating exploration of the technical and organizational evolution of the Influx Data platform, with some promising glimpses of where they are headed in the near future.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today.
We’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14-day trial.
Your host is Tobias Macey and today I’m interviewing Paul Dix about Influx Data and the different facets of the market for timeseries databases
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what you are building at Influx Data and the story behind it?
Timeseries data is a fairly broad category with many variations in terms of storage volume, frequency, processing requirements, etc. This has led to an explosion of database engines and related tools to address these different needs. How do you think about your position and role in the ecosystem?
Who are your target customers and how does that focus inform your product and feature priorities?
What are the use cases that Influx is best suited for?
Can you give an overview of the different projects, tools, and services that comprise your platform?
How is InfluxDB architected?
How have the design and implementation of the DB engine changed or evolved since you first began working on it?
What are you optimizing for on the consistency vs. availability spectrum of CAP?
What is your approach to clustering/data distribution beyond a single node?
For the interface to your database engine you developed a custom query language. What was your process for deciding what syntax to use and how to structure the programmatic interface?
How do you handle the lifecycle of data in an Influx deployment? (e.g. aging out old data, periodic compaction/rollups, etc.)
With your strong focus on monitoring use cases, how do you handle the challenge of high cardinality in the data being stored?
What are some of the data modeling considerations that users should be aware of as they are designing a deployment of Influx?
What is the role of open source in your product strategy?
What are the most interesting, innovative, or unexpected ways that you have seen the Influx platform used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on Influx?
When is Influx DB and/or the associated tools the wrong choice?
What do you have planned for the future of Influx Data?
Contact Info
LinkedIn
pauldix on GitHub
@pauldix on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
Influx Data
Influx DB
Search and Information Retrieval
Datadog
Podcast Episode
New Relic
StackDriver
Scala
Cassandra
Redis
KDB
Latent Semantic Indexing
TICK Stack
ELK Stack
Prometheus
TSM storage engine
TSI Storage Engine
Golang
Rust Language
RAFT Protocol
Telegraf
Kafka
InfluxQL
Flux Language
DataFusion
Apache Arrow
Apache Parquet
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Jun 26, 2021 • 1h 11min
Lessons Learned From The Pipeline Data Engineering Academy
Summary
Data Engineering is a broad and constantly evolving topic, which makes it difficult to teach in a concise and effective manner. Despite that, Daniel Molnar and Peter Fabian started the Pipeline Academy to do exactly that. In this episode they reflect on the lessons that they learned while teaching the first cohort of their bootcamp how to be effective data engineers. By focusing on the fundamentals, and making everyone write code, they were able to build confidence and impart the importance of context for their students.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
Your host is Tobias Macey and today I’m interviewing Daniel Molnar and Peter Fabian about the lessons that they learned from their first cohort at the Pipeline data engineering academy
Interview
Introduction
How did you get involved in the area of data management?
Can you start by sharing the curriculum and learning goals for the students?
How did you set a common baseline for all of the students to build from throughout the program?
What was your process for determining the structure of the tasks and the tooling used?
What were some of the topics/tools that the students had the most difficulty with?
What topics/tools were the easiest to grasp?
What are some difficulties that you encountered while trying to teach different concepts?
How did you deal with the tension of teaching the fundamentals while tying them to toolchains that hiring managers are looking for?
What are the successes that you had with this cohort and what changes are you making to your approach/curriculum to build on them?
What are some of the failures that you encountered and what lessons have you taken from them?
How did the pandemic impact your overall plan and execution of the initial cohort?
What were the skills that you focused on for interview preparation?
What level of ongoing support/engagement do you have with students once they complete the curriculum?
What are the most interesting, innovative, or unexpected solutions that you saw from your students?
What are the most interesting, unexpected, or challenging lessons that you have learned while working with your first cohort?
When is a bootcamp the wrong approach for skill development?
What do you have planned for the future of the Pipeline Academy?
Contact Info
Daniel
LinkedIn
Website
@soobrosa on Twitter
Peter
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Pipeline Academy
Blog
Scikit
Pandas
Urchin
Kafka
Three "C"s – Context, Confidence, and Code
Prefect
Podcast Episode
Great Expectations
Podcast Episode
Podcast.__init__ Episode
Docker
Kubernetes
Become a Data Engineer On A Shoestring
James Mickens
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Jun 23, 2021 • 58min
Make Database Performance Optimization A Playful Experience With OtterTune
Summary
The database is the core of any system because it holds the data that drives your entire experience. We spend countless hours designing the data model, updating engine versions, and tuning performance. But how confident are you that you have configured it to be as performant as possible, given the dozens of parameters and how they interact with each other? Andy Pavlo researches autonomous database systems, and out of that research he created OtterTune to find the optimal set of parameters to use for your specific workload. In this episode he explains how the system works, the challenge of scaling it to work across different database engines, and his hopes for the future of database systems.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today.
We’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14-day trial.
Your host is Tobias Macey and today I’m interviewing Andy Pavlo about OtterTune, a system to continuously monitor and improve database performance via machine learning
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what OtterTune is and the story behind it?
How does it relate to your work with NoisePage?
What are the challenges that database administrators, operators, and users run into when working with, configuring, and tuning transactional systems?
What are some of the contributing factors to the sprawling complexity of the configurable parameters for these databases?
Can you describe how OtterTune is implemented?
What are some of the aggregate benefits that OtterTune can gain by running as a centralized service and learning from all of the systems that it connects to?
What are some of the assumptions that you made when starting the commercialization of this technology that have been challenged or invalidated as you began working with initial customers?
How have the design and goals of the system changed or evolved since you first began working on it?
What is involved in adding support for a new database engine?
How applicable are the OtterTune capabilities to analytical database engines?
How do you handle tuning for variable or evolving workloads?
What are some of the most interesting or esoteric configuration options that you have come across while working on OtterTune?
What are some that made you facepalm?
What are the most interesting, innovative, or unexpected ways that you have seen OtterTune used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on OtterTune?
When is OtterTune the wrong choice?
What do you have planned for the future of OtterTune?
Contact Info
CMU Page
apavlo on GitHub
@andy_pavlo on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
OtterTune
CMU (Carnegie Mellon University)
Brown University
Michael Stonebraker
H-Store
Learned Indexes
NoisePage
Oracle DB
PostgreSQL
Podcast Episode
MySQL
RDS
Gaussian Process Model
Reinforcement Learning
AWS Aurora
MVCC (Multi-Version Concurrency Control)
Puppet
VectorWise
GreenPlum
Snowflake
Podcast Episode
PGTune
MySQL Tuner
SIGMOD
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Jun 18, 2021 • 41min
Bring Order To The Chaos Of Your Unstructured Data Assets With Unstruk
Summary
Working with unstructured data has typically been a motivation for a data lake. The challenge is imposing enough order on the platform to make it useful. Kirk Marple has spent years working with data systems and the media industry, which inspired him to build a platform for automatically organizing your unstructured assets to make them more valuable. In this episode he shares the goals of the Unstruk Data Warehouse, how it is architected to extract asset metadata and build a searchable knowledge graph from the information, and the myriad ways that the system can be used. If you are wondering how to deal with all of the information that doesn’t fit in your databases or data warehouses, then this episode is for you.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
Your host is Tobias Macey and today I’m interviewing Kirk Marple about Unstruk Data, a company that is building a data warehouse for unstructured data that ofers automated data preparation via metadata enrichment, integrated compute, and graph-based search
Interview
Introduction
How did you get involved in the area of data management?
Can you describe what Unstruk Data is and the story behind it?
What would you classify as "unstructured data"?
What are some examples of industries that rely on large or varied sets of unstructured data?
What are the challenges for analytics that are posed by the different categories of unstructured data?
What is the current state of the industry for working with unstructured data?
What are the unique capabilities that Unstruk provides and how does it integrate with the rest of the ecosystem?
Where does it sit in the overall landscape of data tools?
Can you describe how the Unstruk data warehouse is implemented?
What are the assumptions that you had at the start of this project that have been challenged as you started working through the technical implementation and customer trials?
How has the design and architecture evolved or changed since you began working on it?
How do you handle versioning of data, given the potential for individual files to be quite large?
What are some of the considerations that users should have in mind when modeling their data in the warehouse?
Can you talk through the workflow of ingesting and analyzing data with Unstruk?
How do you manage data enrichment/integration with structured data sources?
What are the most interesting, innovative, or unexpected ways that you have seen the technology of Unstruk used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on and with the Unstruk platform?
When is Unstruk the wrong choice?
What do you have planned for the future of Unstruk?
Contact Info
LinkedIn
@KirkMarple on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Unstruk Data
TIFF
ROSBag
HDF5
Media/Digital Asset Management
Data Mesh
SAN
NAS
Knowledge Graph
Entity Extraction
OCR (Optical Character Recognition)
Cloud Native
Cosmos DB
Azure Functions
Azure EventHub
Azure Cognitive Search
GraphQL
KNative
Schema.org
Pinecone Vector Database
Podcast Episode
Dublin Core Metadata Initiative
Knowledge Management
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Jun 15, 2021 • 1h 6min
Accelerating ML Training And Delivery With In-Database Machine Learning
Summary
When you build a machine learning model, the first step is always to load your data. Typically this means downloading files from object storage, or querying a database. To speed up the process, why not build the model inside the database so that you don’t have to move the information? In this episode Paige Roberts explains the benefits of pushing the machine learning processing into the database layer and the approach that Vertica has taken for their implementation. If you are looking for a way to speed up your experimentation, or an easy way to apply AutoML then this conversation is for you.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today.
We’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14-day trial.
Your host is Tobias Macey and today I’m interviewing Paige Roberts about machine learning workflows inside the database
Interview
Introduction
How did you get involved in the area of data management?
Can you start by giving an overview of the current state of the market for databases that support in-process machine learning?
What are the motivating factors for running a machine learning workflow inside the database?
What styles of ML are feasible to do inside the database? (e.g. bayesian inference, deep learning, etc.)
What are the performance implications of running a model training pipeline within the database runtime? (both in terms of training performance boosts, and database performance impacts)
Can you describe the architecture of how the machine learning process is managed by the database engine?
How do you manage interacting with Python/R/Jupyter/etc. when working within the database?
What is the impact on data pipeline and MLOps architectures when using the database to manage the machine learning workflow?
What are the most interesting, innovative, or unexpected ways that you have seen in-database ML used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working on machine learning inside the database?
When is in-database ML the wrong choice?
What are the recent trends/changes in machine learning for the database that you are excited for?
Contact Info
LinkedIn
Blog
@RobertsPaige on Twitter
@PaigeEwing on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
Vertica
SyncSort
Hortonworks
Infoworld – 8 databases supporting in-database machine learning
Power BI
Podcast Episode
Grafana
Tableau
K-Means Clustering
MPP == Massively Parallel Processing
AutoML
Random Forest
PMML == Predictive Model Markup Language
SVM == Support Vector Machine
Naive Bayes
XGBoost
Pytorch
Tensorflow
Neural Magic
Tensorflow Frozen Graph
Parquet
ORC
Avro
CNCF == Cloud Native Computing Foundation
Hotel California
VerticaPy
Pandas
Podcast.__init__ Episode
Jupyter Notebook
UDX
Unifying Analytics Presentation
Hadoop
Yarn
Holden Karau
Spark
Vertica Academy
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

6 snips
Jun 12, 2021 • 53min
Taking A Tour Of The Google Cloud Platform For Data And Analytics
Summary
Google pioneered an impressive number of the architectural underpinnings of the broader big data ecosystem. Now they offer the technologies that they run internally to external users of their cloud platform. In this episode Lak Lakshmanan enumerates the variety of services that are available for building your various data processing and analytical systems. He shares some of the common patterns for building pipelines to power business intelligence dashboards, machine learning applications, and data warehouses. If you’ve ever been overwhelmed or confused by the array of services available in the Google Cloud Platform then this episode is for you.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show!
Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch.
Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription
Your host is Tobias Macey and today I’m interviewing Lak Lakshmanan about the suite of services for data and analytics in Google Cloud Platform.
Interview
Introduction
How did you get involved in the area of data management?
Can you start by giving an overview of the tools and products that are offered as part of Google Cloud for data and analytics?
How do the various systems relate to each other for building a full workflow?
How do you balance the need for clean integration between services with the need to make them useful in isolation when used as a single component of a data platform?
What have you found to be the primary motivators for customers who are adopting GCP for some or all of their data workloads?
What are some of the challenges that new users of GCP encounter when working with the data and analytics products that it offers?
What are the systems that you have found to be easiest to work with?
Which are the most challenging to work with, whether due to the kinds of problems that they are solving for, or due to their user experience design?
How has your work with customers fed back into the products that you are building on top of?
What are some examples of architectural or software patterns that are unique to the GCP product suite?
What are the most interesting, innovative, or unexpected ways that you have seen Google Cloud’s data and analytics services used?
What are the most interesting, unexpected, or challenging lessons that you have learned while working at Google and helping customers succeed in their data and analytics efforts?
What are some of the new capabilities, new services, or industry trends that you are most excited for?
Contact Info
LinkedIn
@lak_gcp on Twitter
Website
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Google Cloud
Data and Analytics Services
Forrester Wave
Dremel
BigQuery
MapReduce
Cloud Spanner
Spanner Paper
Hadoop
Tensorflow
Google Cloud SQL
Apache Spark
Dataproc
Dataflow
Apache Beam
Databricks
Mixpanel
Avalanche data warehouse
Kubernetes
GKE (Google Kubernetes Engine)
Google Cloud Run
Android
Youtube
Google Translate
Teradata
Power BI
Podcast Episode
AI Platform Notebooks
GitHub Data Repository
Stack Overflow Questions Data Repository
PyPI Download Statistics
Recommendations AI
Pub/Sub
Bigtable
Datastream
Change Data Capture
Podcast Episode About Debezium for CDC
Podcast Episode About CDC with Datacoral
Document AI
Google Meet
Data Governance
Podcast Episodes
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast


