Data Engineering Podcast

Tobias Macey

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.

Episodes

Mentioned books

Jul 28, 2021 • 1h

Building a Multi-Tenant Managed Platform For Streaming Data With Pulsar at Datastax

Summary Everyone expects data to be transmitted, processed, and updated instantly as more and more products integrate streaming data. The technology to make that possible has been around for a number of years, but the barriers to adoption have still been high due to the level of technical understanding and operational capacity that have been required to run at scale. Datastax has recently introduced a new managed offering for Pulsar workloads in the form of Astra Streaming that lowers those barriers and make stremaing workloads accessible to a wider audience. In this episode Prabhat Jha and Jonathan Ellis share the work that they have been doing to integrate streaming data into their managed Cassandra service. They explain how Pulsar is being used by their customers, the work that they have done to scale the administrative workload for multi-tenant environments, and the challenges of operating such a data intensive service at large scale. This is a fascinating conversation with a lot of useful lessons for anyone who wants to understand the operational aspects of Pulsar and the benefits that it can provide to data workloads. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management You listen to this show to learn about all of the latest tools, patterns, and practices that power data engineering projects across every domain. Now there’s a book that captures the foundational lessons and principles that underly everything that you hear about here. I’m happy to announce I collected wisdom from the community to help you in your journey as a data engineer and worked with O’Reilly to publish it as 97 Things Every Data Engineer Should Know. Go to dataengineeringpodcast.com/97things today to get your copy! When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. We’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14-day trial. Your host is Tobias Macey and today I’m interviewing Prabhat Jha and Jonathan Ellis about Astra Streaming, a cloud-native streaming platform built on Apache Pulsar Interview Introduction How did you get involved in the area of data management? Can you describe what the Astra platform is and the story behind it? How does streaming fit into your overall product vision and the needs of your customers? What was your selection process/criteria for adopting a streaming engine to complement your existing technology investment? What are the core use cases that you are aiming to support with Astra Streaming? Can you describe the architecture and automation of your hosted platform for Pulsar? What are the integration points that you have built to make it work well with Cassandra? What are some of the additional tools that you have added to your distribution of Pulsar to simplify operation and use? What are some of the sharp edges that you have had to sand down as you have scaled up your usage of Pulsar? What is the process for someone to adopt and integrate with your Astra Streaming service? How do you handle migrating existing projects, particularly if they are using Kafka currently? One of the capabilities that you highlight on the product page for Astra Streaming is the ability to execute machine learning workflows on data in flight. What are some of the supporting systems that are necessary to power that workflow? What are the capabilities that are built into Pulsar that simplify the operational aspects of streaming ML? What are the ways that you are engaging with and supporting the Pulsar community? What are the near to medium term elements of the Pulsar roadmap that you are working toward and excited to incorporate into Astra? What are the most interesting, innovative, or unexpected ways that you have seen Astra used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Astra? When is Astra the wrong choice? What do you have planned for the future of Astra? Contact Info Prabhat LinkedIn @prabhatja on Twitter prabhatja on GitHub Jonathan LinkedIn @spyced on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Pulsar Podcast Episode Streamnative Episode Datastax Astra Streaming Datastax Astra DB Luna Streaming Distribution Datastax Cassandra Kesque (formerly Kafkaesque) Kafka RabbitMQ Prometheus Grafana Pulsar Heartbeat Pulsar Summit Pulsar Summit Presentation on Kafka Connectors Replicated Chaos Engineering Fallout chaos engineering tools Jepsen Podcast Episode Jack VanLightly BookKeeper TLA+ Model Change Data Capture The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Jul 23, 2021 • 1h 1min

Bringing The Metrics Layer To The Masses With Transform

Summary Collecting and cleaning data is only useful if someone can make sense of it afterward. The latest evolution in the data ecosystem is the introduction of a dedicated metrics layer to help address the challenge of adding context and semantics to raw information. In this episode Nick Handel shares the story behind Transform, a new platform that provides a managed metrics layer for your data platform. He explains the challenges that occur when metrics are maintained across a variety of systems, the benefits of unifying them in a common access layer, and the potential that it unlocks for everyone in the business to confidently answer questions with data. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management You listen to this show to learn about all of the latest tools, patterns, and practices that power data engineering projects across every domain. Now there’s a book that captures the foundational lessons and principles that underly everything that you hear about here. I’m happy to announce I collected wisdom from the community to help you in your journey as a data engineer and worked with O’Reilly to publish it as 97 Things Every Data Engineer Should Know. Go to dataengineeringpodcast.com/97things today to get your copy! When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch. Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Your host is Tobias Macey and today I’m interviewing Nick Handel about Transform, a platform providing a dedicated metrics layer for your data stack Interview Introduction How did you get involved in the area of data management? Can you describe what Transform is and the story behind it? How do you define the concept of a "metric" in the context of the data platform? What are the general strategies in the industry for creating, managing, and consuming metrics? How has that been changing in the past couple of years? What is driving that shift? What are the main goals that you have for the Transform platform? Who are the target users? How does that focus influence your approach to the design of the platform? How is the Transform platform architected? What are the core capabilities that are required for a metrics service? What are the integration points for a metrics service? Can you talk through the workflow of defining and consuming metrics with Transform? What are the challenges that teams face in establishing consensus or a shared understanding around a given metric definition? What are the lifecycle stages that need to be factored into the long-term maintenance of a metric definition? What are some of the capabilities or projects that are made possible by having a metrics layer in the data platform? What are the capabilities in downstream tools that are currently missing or underdeveloped to support the metrics store as a core layer of the platform? What are the most interesting, innovative, or unexpected ways that you have seen Transform used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Transform? When is Transform the wrong choice? What do you have planned for the future of Transform? Contact Info LinkedIn @nick_handel on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Transform Transform’s Metrics Framework Transform’s Metrics Catalog Transform’s Metrics API Nick’s experiences using Airbnb’s Metrics Store Get Transform BlackRock AirBnB Airflow Superset Podcast Episode AirBnB Knowledge Repo AirBnB Minerva Metric Store OLAP Cube Semantic Layer Master Data Management Podcast Episode Data Normalization OpenLineage The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Jul 20, 2021 • 1h 1min

Strategies For Proactive Data Quality Management

Summary Data quality is a concern that has been gaining attention alongside the rising importance of analytics for business success. Many solutions rely on hand-coded rules for catching known bugs, or statistical analysis of records to detect anomalies retroactively. While those are useful tools, it is far better to prevent data errors before they become an outsized issue. In this episode Gleb Mezhanskiy shares some strategies for adding quality checks at every stage of your development and deployment workflow to identify and fix problematic changes to your data before they get to production. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management You listen to this show to learn about all of the latest tools, patterns, and practices that power data engineering projects across every domain. Now there’s a book that captures the foundational lessons and principles that underly everything that you hear about here. I’m happy to announce I collected wisdom from the community to help you in your journey as a data engineer and worked with O’Reilly to publish it as 97 Things Every Data Engineer Should Know. Go to dataengineeringpodcast.com/97things today to get your copy! When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. We’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14-day trial. Your host is Tobias Macey and today I’m interviewing Gleb Mezhanskiy about strategies for proactive data quality management and his work at Datafold to help provide tools for implementing them Interview Introduction How did you get involved in the area of data management? Can you describe what you are building at Datafold and the story behind it? What are the biggest factors that you see contributing to data quality issues? How are teams identifying and addressing those failures? How does the data platform architecture impact the potential for introducing quality problems? What are some of the potential risks or consequences of introducing errors in data processing? How can organizations shift to being proactive in their data quality management? How much of a role does tooling play in addressing the introduction and remediation of data quality problems? Can you describe how Datafold is designed and architected to allow for proactive management of data quality? What are some of the original goals and assumptions about how to empower teams to improve data quality that have been challenged or changed as you have worked through building Datafold? What is the workflow for an individual or team who is using Datafold as part of their data pipeline and platform development? What are the organizational patterns that you have found to be most conducive to proactive data quality management? Who is responsible for identifying and addressing quality issues? What are the most interesting, innovative, or unexpected ways that you have seen Datafold used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Datafold? When is Datafold the wrong choice? What do you have planned for the future of Datafold? Contact Info LinkedIn @glebmm on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Datafold Autodesk Airflow Podcast.__init__ Episode Spark Looker Podcast Episode Amundsen Podcast Episode dbt Podcast Episode Dagster Podcast Episode Podcast.__init__ Episode Change Data Capture Podcast Episodes Delta Lake Podcast Episode Trino Podcast Episode Presto Parquet Podcast Episode Data Quality Meetup The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Special Guest: Gleb Mezhanskiy.Support Data Engineering Podcast

Jul 16, 2021 • 1h 13min

Low Code And High Quality Data Engineering For The Whole Organization With Prophecy

Summary There is a wealth of tools and systems available for processing data, but the user experience of integrating them and building workflows is still lacking. This is particularly important in large and complex organizations where domain knowledge and context is paramount and there may not be access to engineers for codifying that expertise. Raj Bains founded Prophecy to address this need by creating a UI first platform for building and executing data engineering workflows that orchestrates Airflow and Spark. Rather than locking your business logic into a proprietary storage layer and only exposing it through a drag-and-drop editor Prophecy synchronizes all of your jobs with source control, allowing an easy bi-directional interaction between code first and no-code experiences. In this episode he shares his motivations for creating Prophecy, how he is leveraging the magic of compilers to translate between UI and code oriented representations of logic, and the organizational benefits of having a cohesive experience designed to bring business users and domain experts into the same platform as data engineers and analysts. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management You listen to this show to learn about all of the latest tools, patterns, and practices that power data engineering projects across every domain. Now there’s a book that captures the foundational lessons and principles that underly everything that you hear about here. I’m happy to announce I collected wisdom from the community to help you in your journey as a data engineer and worked with O’Reilly to publish it as 97 Things Every Data Engineer Should Know. Go to dataengineeringpodcast.com/97things today to get your copy! When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch. Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Your host is Tobias Macey and today I’m interviewing Raj Bains about Prophecy, a low-code data engineering platform built on Spark and Airflow Interview Introduction How did you get involved in the area of data management? Can you describe what you are building at Prophecy and the story behind it? There are a huge number of tools and recommended architectures for every variety of data need. Why is data engineering still such a complicated and challenging undertaking? What features and capabilities does Prophecy provide to help address those issues? What are the roles and use cases that you are focusing on serving with Prophecy? What are the elements of the data platform that Prophecy can replace? Can you describe how Prophecy is implemented? What was your selection criteria for the foundational elements of the platform? What would be involved in adopting other execution and orchestration engines? Can you describe the workflow of building a pipeline with Prophecy? What are the design and structural features that you have built to manage workflows as they scale in terms of technical and organizational complexity? What are the options for data engineers/data professionals to build and share reusable components across the organization? What are the most interesting, innovative, or unexpected ways that you have seen Prophecy used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Prophecy? When is Prophecy the wrong choice? What do you have planned for the future of Prophecy? Contact Info LinkedIn @_raj_bains on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Prophecy CUDA Apache Hive Hortonworks NoSQL NewSQL Paxos Apache Impala AbInitio Teradata Snowflake Podcast Episode Presto Podcast Episode LinkedIn Spark Databricks Cron Airflow Astronomer Alteryx Streamsets Azure Data Factory Apache Flink Podcast Episode Prefect Podcast Episode Dagster Podcast Episode Podcast.__init__ Episode Kubernetes Operator Scala Kafka Abstract Syntax Tree Language Server Protocol Amazon Deequ dbt Tecton Podcast Episode Informatica The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Jul 13, 2021 • 49min

Exploring The Design And Benefits Of The Modern Data Stack

Summary We have been building platforms and workflows to store, process, and analyze data since the earliest days of computing. Over that time there have been countless architectures, patterns, and "best practices" to make that task manageable. With the growing popularity of cloud services a new pattern has emerged and been dubbed the "Modern Data Stack". In this episode members of the GoDataDriven team, Guillermo Sanchez, Bram Ochsendorf, and Juan Perafan, explain the combinations of services that comprise this architecture, share their experiences working with clients to employ the stack, and the benefits of bringing engineers and business users together with data. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management You listen to this show to learn about all of the latest tools, patterns, and practices that power data engineering projects across every domain. Now there’s a book that captures the foundational lessons and principles that underly everything that you hear about here. I’m happy to announce I collected wisdom from the community to help you in your journey as a data engineer and worked with O’Reilly to publish it as 97 Things Every Data Engineer Should Know. Go to dataengineeringpodcast.com/97things today to get your copy! When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. We’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14-day trial. Your host is Tobias Macey and today I’m interviewing Guillermo Sanchez, Bram Ochsendorf, and Juan Perafan about their experiences with managed services in the modern data stack in their work as consultants at GoDataDriven Interview Introduction How did you get involved in the area of data management? Can you start by giving your definition of the modern data stack? What are the key characteristics of a tool or platform that make it a candidate for the "modern" stack? How does the modern data stack shift the responsibilities and capabilities of data professionals and consumers? What are some difficulties that you face when working with customers to migrate to these new architectures? What are some of the limitations of the components or paradigms of the modern stack? What are some strategies that you have devised for addressing those limitations? What are some edge cases that you have run up against with specific vendors that you have had to work around? What are the "gotchas" that you don’t run up against until you’ve deployed a service and started using it at scale and over time? How does data governance get applied across the various services and systems of the modern stack? One of the core promises of cloud-based and managed services for data is the ability for data analysts and consumers to self-serve. What kinds of training have you found to be necessary/useful for those end-users? What is the role of data engineers in the context of the "modern" stack? What are the most interesting, innovative, or unexpected manifestations of the modern data stack that you have seen? What are the most interesting, unexpected, or challenging lessons that you have learned while working with customers to implement a modern data stack? When is the modern data stack the wrong choice? What new architectures or tools are you keeping an eye on for future client work? Contact Info Guillermo LinkedIn guillesd on GitHub Bram LinkedIn bramochsendorf on GitHub Juan LinkedIn jmperafan on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links GoDataDriven Deloitte RPA == Robotic Process Automation Analytics Engineer James Webb Space Telescope Fivetran Podcast Episode dbt Podcast Episode Data Governance Podcast Episodes Azure Cloud Platform Stitch Data Airflow Prefect Argo Project Looker Azure Purview Soda Data Podcast Episode Datafold Materialize Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Jul 9, 2021 • 1h 7min

Democratize Data Cleaning Across Your Organization With Trifacta

Summary Every data project, whether it’s analytics, machine learning, or AI, starts with the work of data cleaning. This is a critical step and benefits from being accessible to the domain experts. Trifacta is a platform for managing your data engineering workflow to make curating, cleaning, and preparing your information more approachable for everyone in the business. In this episode CEO Adam Wilson shares the story behind the business, discusses the myriad ways that data wrangling is performed across the business, and how the platform is architected to adapt to the ever-changing landscape of data management tools. This is a great conversation about how deliberate user experience and platform design can make a drastic difference in the amount of value that a business can provide to their customers. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management You listen to this show to learn about all of the latest tools, patterns, and practices that power data engineering projects across every domain. Now there’s a book that captures the foundational lessons and principles that underly everything that you hear about here. I’m happy to announce I collected wisdom from the community to help you in your journey as a data engineer and worked with O’Reilly to publish it as 97 Things Every Data Engineer Should Know. Go to dataengineeringpodcast.com/97things today to get your copy! When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch. Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Your host is Tobias Macey and today I’m interviewing Adam Wilson about Trifacta, a platform for modern data workers to assess quality, transform, and automate data pipelines Interview Introduction How did you get involved in the area of data management? Can you describe what Trifacta is and the story behind it? Across your site and material you focus on using the term "data wrangling". What is your personal definition of that term, and in what ways do you differentiate from ETL/ELT? How does the deliberate use of that terminology influence the way that you think about the design and features of the Trifacta platform? What is Trifacta’s role in the overall data platform/data lifecycle for an organization? What are some examples of tools that Trifacta might replace? What tools or systems does Trifacta integrate with? Who are the target end-users of the Trifacta platform and how do those personas direct the design and functionality? Can you describe how Trifacta is architected? How have the goals and design of the system changed or evolved since you first began working on it? Can you talk through the workflow and lifecycle of data as it traverses your platform, and the user interactions that drive it? How can data engineers share and encourage proper patterns for working with data assets with end-users across the organization? What are the limits of scale for volume and complexity of data assets that users are able to manage through Trifacta’s visual tools? What are some strategies that you and your customers have found useful for pre-processing the information that enters your platform to increase the accessibility for end-users to self-serve? What are the most interesting, innovative, or unexpected ways that you have seen Trifacta used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Trifacata? When is Trifacta the wrong choice? What do you have planned for the future of Trifacta? Contact Info LinkedIn @a_adam_wilson on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Trifacta Informatica UC Berkeley Stanford University Citadel Podcast Episode Stanford Data Wrangler DBT Podcast Episode Pig Databricks Sqoop Flume SPSS Tableau SDLC == Software Delivery Life-Cycle The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Jul 5, 2021 • 56min

Stick All Of Your Systems And Data Together With SaaSGlue As Your Workflow Manager

Summary At the core of every data pipeline is an workflow manager (or several). Deploying, managing, and scaling that orchestration can consume a large fraction of a data team’s energy so it is important to pick something that provides the power and flexibility that you need. SaaSGlue is a managed service that lets you connect all of your systems, across clouds and physical infrastructure, and spanning all of your programming languages. In this episode Bart and Rich Wood explain how SaaSGlue is architected to allow for a high degree of flexibility in usage and deployment, their experience building a business with family, and how you can get started using it today. This is a fascinating platform with an endless set of use cases and a great team of people behind it. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. We’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14-day trial. Your host is Tobias Macey and today I’m interviewing Rich and Bart Wood about SaasGlue, a SaaS-based integration, orchestration and automation platform that lets you fill the gaps in your existing automation infrastructure Interview Introduction How did you get involved in the area of data management? Can you describe what SaasGlue is and the story behind it? I understand that you are building this company with your 3 brothers. What have been the pros and cons of working with your family on this project? What are the main use cases that you are focused on enabling? Who are your target users and how has that influenced the features and design of the platform? Orchestration, automation, and workflow management are all areas that have a range of active products and projects. How do you characterize SaaSGlue’s position in the overall ecosystem? What are some of the ways that you see it integrated into a data platform? What are the core elements and concepts of the SaaSGlue platform? How is the SaaSGlue platform architected? How have the goals and design of the platform changed or evolved since you first began working on it? What are some of the assumptions that you had at the beginning of the project which have been challenged or changed as you worked through building it? Can you talk through the workflow of someone building a task graph with SaaSGlue? How do you handle dependency management for custom code in the payloads for agent tasks? How does SaasGlue manage metadata propagation throughout the execution graph? How do you handle the myriad failure modes that you are likely to encounter? (e.g. agent failure, network partitions, individual task failures, etc.) What are some of the tools/platforms/architectural paradigms that you looked to for inspiration while designing and building SaaSGlue? What are the most interesting, innovative, or unexpected ways that you have seen SaasGlue used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on SaasGlue? When is SaaSGlue the wrong choice? What do you have planned for the future of SaaSGlue? Contact Info Rich LinkedIn Bart LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links SaaSGlue Jenkins Cron Airflow Ansible Terraform DSL == Domain Specific Language Clojure Gradle Polymorphism Dagster Podcast Episode Podcast.__init__ Episode Martin Kleppman The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Jul 3, 2021 • 1h 5min

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app