Data Engineering Podcast

Tobias Macey
undefined
Jul 13, 2021 • 49min

Exploring The Design And Benefits Of The Modern Data Stack

Summary We have been building platforms and workflows to store, process, and analyze data since the earliest days of computing. Over that time there have been countless architectures, patterns, and "best practices" to make that task manageable. With the growing popularity of cloud services a new pattern has emerged and been dubbed the "Modern Data Stack". In this episode members of the GoDataDriven team, Guillermo Sanchez, Bram Ochsendorf, and Juan Perafan, explain the combinations of services that comprise this architecture, share their experiences working with clients to employ the stack, and the benefits of bringing engineers and business users together with data. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management You listen to this show to learn about all of the latest tools, patterns, and practices that power data engineering projects across every domain. Now there’s a book that captures the foundational lessons and principles that underly everything that you hear about here. I’m happy to announce I collected wisdom from the community to help you in your journey as a data engineer and worked with O’Reilly to publish it as 97 Things Every Data Engineer Should Know. Go to dataengineeringpodcast.com/97things today to get your copy! When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. We’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14-day trial. Your host is Tobias Macey and today I’m interviewing Guillermo Sanchez, Bram Ochsendorf, and Juan Perafan about their experiences with managed services in the modern data stack in their work as consultants at GoDataDriven Interview Introduction How did you get involved in the area of data management? Can you start by giving your definition of the modern data stack? What are the key characteristics of a tool or platform that make it a candidate for the "modern" stack? How does the modern data stack shift the responsibilities and capabilities of data professionals and consumers? What are some difficulties that you face when working with customers to migrate to these new architectures? What are some of the limitations of the components or paradigms of the modern stack? What are some strategies that you have devised for addressing those limitations? What are some edge cases that you have run up against with specific vendors that you have had to work around? What are the "gotchas" that you don’t run up against until you’ve deployed a service and started using it at scale and over time? How does data governance get applied across the various services and systems of the modern stack? One of the core promises of cloud-based and managed services for data is the ability for data analysts and consumers to self-serve. What kinds of training have you found to be necessary/useful for those end-users? What is the role of data engineers in the context of the "modern" stack? What are the most interesting, innovative, or unexpected manifestations of the modern data stack that you have seen? What are the most interesting, unexpected, or challenging lessons that you have learned while working with customers to implement a modern data stack? When is the modern data stack the wrong choice? What new architectures or tools are you keeping an eye on for future client work? Contact Info Guillermo LinkedIn guillesd on GitHub Bram LinkedIn bramochsendorf on GitHub Juan LinkedIn jmperafan on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links GoDataDriven Deloitte RPA == Robotic Process Automation Analytics Engineer James Webb Space Telescope Fivetran Podcast Episode dbt Podcast Episode Data Governance Podcast Episodes Azure Cloud Platform Stitch Data Airflow Prefect Argo Project Looker Azure Purview Soda Data Podcast Episode Datafold Materialize Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
undefined
Jul 9, 2021 • 1h 7min

Democratize Data Cleaning Across Your Organization With Trifacta

Summary Every data project, whether it’s analytics, machine learning, or AI, starts with the work of data cleaning. This is a critical step and benefits from being accessible to the domain experts. Trifacta is a platform for managing your data engineering workflow to make curating, cleaning, and preparing your information more approachable for everyone in the business. In this episode CEO Adam Wilson shares the story behind the business, discusses the myriad ways that data wrangling is performed across the business, and how the platform is architected to adapt to the ever-changing landscape of data management tools. This is a great conversation about how deliberate user experience and platform design can make a drastic difference in the amount of value that a business can provide to their customers. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management You listen to this show to learn about all of the latest tools, patterns, and practices that power data engineering projects across every domain. Now there’s a book that captures the foundational lessons and principles that underly everything that you hear about here. I’m happy to announce I collected wisdom from the community to help you in your journey as a data engineer and worked with O’Reilly to publish it as 97 Things Every Data Engineer Should Know. Go to dataengineeringpodcast.com/97things today to get your copy! When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch. Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Your host is Tobias Macey and today I’m interviewing Adam Wilson about Trifacta, a platform for modern data workers to assess quality, transform, and automate data pipelines Interview Introduction How did you get involved in the area of data management? Can you describe what Trifacta is and the story behind it? Across your site and material you focus on using the term "data wrangling". What is your personal definition of that term, and in what ways do you differentiate from ETL/ELT? How does the deliberate use of that terminology influence the way that you think about the design and features of the Trifacta platform? What is Trifacta’s role in the overall data platform/data lifecycle for an organization? What are some examples of tools that Trifacta might replace? What tools or systems does Trifacta integrate with? Who are the target end-users of the Trifacta platform and how do those personas direct the design and functionality? Can you describe how Trifacta is architected? How have the goals and design of the system changed or evolved since you first began working on it? Can you talk through the workflow and lifecycle of data as it traverses your platform, and the user interactions that drive it? How can data engineers share and encourage proper patterns for working with data assets with end-users across the organization? What are the limits of scale for volume and complexity of data assets that users are able to manage through Trifacta’s visual tools? What are some strategies that you and your customers have found useful for pre-processing the information that enters your platform to increase the accessibility for end-users to self-serve? What are the most interesting, innovative, or unexpected ways that you have seen Trifacta used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Trifacata? When is Trifacta the wrong choice? What do you have planned for the future of Trifacta? Contact Info LinkedIn @a_adam_wilson on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Trifacta Informatica UC Berkeley Stanford University Citadel Podcast Episode Stanford Data Wrangler DBT Podcast Episode Pig Databricks Sqoop Flume SPSS Tableau SDLC == Software Delivery Life-Cycle The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
undefined
Jul 5, 2021 • 56min

Stick All Of Your Systems And Data Together With SaaSGlue As Your Workflow Manager

Summary At the core of every data pipeline is an workflow manager (or several). Deploying, managing, and scaling that orchestration can consume a large fraction of a data team’s energy so it is important to pick something that provides the power and flexibility that you need. SaaSGlue is a managed service that lets you connect all of your systems, across clouds and physical infrastructure, and spanning all of your programming languages. In this episode Bart and Rich Wood explain how SaaSGlue is architected to allow for a high degree of flexibility in usage and deployment, their experience building a business with family, and how you can get started using it today. This is a fascinating platform with an endless set of use cases and a great team of people behind it. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. We’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14-day trial. Your host is Tobias Macey and today I’m interviewing Rich and Bart Wood about SaasGlue, a SaaS-based integration, orchestration and automation platform that lets you fill the gaps in your existing automation infrastructure Interview Introduction How did you get involved in the area of data management? Can you describe what SaasGlue is and the story behind it? I understand that you are building this company with your 3 brothers. What have been the pros and cons of working with your family on this project? What are the main use cases that you are focused on enabling? Who are your target users and how has that influenced the features and design of the platform? Orchestration, automation, and workflow management are all areas that have a range of active products and projects. How do you characterize SaaSGlue’s position in the overall ecosystem? What are some of the ways that you see it integrated into a data platform? What are the core elements and concepts of the SaaSGlue platform? How is the SaaSGlue platform architected? How have the goals and design of the platform changed or evolved since you first began working on it? What are some of the assumptions that you had at the beginning of the project which have been challenged or changed as you worked through building it? Can you talk through the workflow of someone building a task graph with SaaSGlue? How do you handle dependency management for custom code in the payloads for agent tasks? How does SaasGlue manage metadata propagation throughout the execution graph? How do you handle the myriad failure modes that you are likely to encounter? (e.g. agent failure, network partitions, individual task failures, etc.) What are some of the tools/platforms/architectural paradigms that you looked to for inspiration while designing and building SaaSGlue? What are the most interesting, innovative, or unexpected ways that you have seen SaasGlue used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on SaasGlue? When is SaaSGlue the wrong choice? What do you have planned for the future of SaaSGlue? Contact Info Rich LinkedIn Bart LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links SaaSGlue Jenkins Cron Airflow Ansible Terraform DSL == Domain Specific Language Clojure Gradle Polymorphism Dagster Podcast Episode Podcast.__init__ Episode Martin Kleppman The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
undefined
Jul 3, 2021 • 1h 5min

Leveling Up Open Source Data Integration With Meltano Hub And The Singer SDK

Summary Data integration in the form of extract and load is the critical first step of every data project. There are a large number of commercial and open source projects that offer that capability but it is still far from being a solved problem. One of the most promising community efforts is that of the Singer ecosystem, but it has been plagued by inconsistent quality and design of plugins. In this episode the members of the Meltano project share the work they are doing to improve the discovery, quality, and capabilities of Singer taps and targets. They explain their work on the Meltano Hub and the Singer SDK and their long term goals for the Singer community. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch. Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Your host is Tobias Macey and today I’m interviewing Douwe Maan, Taylor Murphy, and AJ Steers about their work to level up the Singer ecosystem through projects like Meltano Hub and the Singer SDK Interview Introduction How did you get involved in the area of data management? Can you start by describing what the Singer ecosystem is? What are the current weak points/challenges in the ecosystem? What is the current role of the Meltano project/community within the ecosystem? What are the projects and activities related to Singer that you are focused on? What are the main goals of the Meltano Hub? What criteria are you using to determine which projects to include in the hub? Why is the number of targets so small? What additional functionality do you have planned for the hub? What functionality does the SDK provide? How does the presence of the SDK make it easier to write taps/targets? What do you believe the long-term impacts of the SDK on the overall availability and quality of plugins will be? Now that you have spun out your own business and raised funding, how does that influence the priorities and focus of your work? How do you hope to productize what you have built at Meltano? What are the most interesting, innovative, or unexpected ways that you have seen Meltano and Singer plugins used? What are the most interesting, unexpected, or challenging lessons that you have learned while working with the Singer community and the Meltano project? When is Singer/Meltano the wrong choice? What do you have planned for the future of Meltano, Meltano Hub, and the Singer SDK? Contact Info Douwe Website Taylor LinkedIn @tayloramurphy on Twitter Blog AJ LinkedIn @aaronsteers on Twitter aaronsteers on GitLab Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Singer Meltano Podcast Episode Meltano Hub Singer SDK Concert Genetics GitLab Snowflake dbt Podcast Episode Microsoft SQL Server Airflow Podcast Episode Dagster Podcast Episode Podcast.__init__ Episode Prefect Podcast Episode AWS Athena Reverse ETL REST (REpresentational State Transfer) GraphQL Meltano Interpretation of Singer Specification Vision for the Future of Meltano blog post Coalesce Conference Running Your Data Team Like A Product Team The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
undefined
Jun 29, 2021 • 1h 6min

A Candid Exploration Of Timeseries Data Analysis With InfluxDB

Summary While the overall concept of timeseries data is uniform, its usage and applications are far from it. One of the most demanding applications of timeseries data is for application and server monitoring due to the problem of high cardinality. In his quest to build a generalized platform for managing timeseries Paul Dix keeps getting pulled back into the monitoring arena. In this episode he shares the history of the InfluxDB project, the business that he has helped to build around it, and the architectural aspects of the engine that allow for its flexibility in managing various forms of timeseries data. This is a fascinating exploration of the technical and organizational evolution of the Influx Data platform, with some promising glimpses of where they are headed in the near future. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. We’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14-day trial. Your host is Tobias Macey and today I’m interviewing Paul Dix about Influx Data and the different facets of the market for timeseries databases Interview Introduction How did you get involved in the area of data management? Can you describe what you are building at Influx Data and the story behind it? Timeseries data is a fairly broad category with many variations in terms of storage volume, frequency, processing requirements, etc. This has led to an explosion of database engines and related tools to address these different needs. How do you think about your position and role in the ecosystem? Who are your target customers and how does that focus inform your product and feature priorities? What are the use cases that Influx is best suited for? Can you give an overview of the different projects, tools, and services that comprise your platform? How is InfluxDB architected? How have the design and implementation of the DB engine changed or evolved since you first began working on it? What are you optimizing for on the consistency vs. availability spectrum of CAP? What is your approach to clustering/data distribution beyond a single node? For the interface to your database engine you developed a custom query language. What was your process for deciding what syntax to use and how to structure the programmatic interface? How do you handle the lifecycle of data in an Influx deployment? (e.g. aging out old data, periodic compaction/rollups, etc.) With your strong focus on monitoring use cases, how do you handle the challenge of high cardinality in the data being stored? What are some of the data modeling considerations that users should be aware of as they are designing a deployment of Influx? What is the role of open source in your product strategy? What are the most interesting, innovative, or unexpected ways that you have seen the Influx platform used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Influx? When is Influx DB and/or the associated tools the wrong choice? What do you have planned for the future of Influx Data? Contact Info LinkedIn pauldix on GitHub @pauldix on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Influx Data Influx DB Search and Information Retrieval Datadog Podcast Episode New Relic StackDriver Scala Cassandra Redis KDB Latent Semantic Indexing TICK Stack ELK Stack Prometheus TSM storage engine TSI Storage Engine Golang Rust Language RAFT Protocol Telegraf Kafka InfluxQL Flux Language DataFusion Apache Arrow Apache Parquet The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
undefined
Jun 26, 2021 • 1h 11min

Lessons Learned From The Pipeline Data Engineering Academy

Summary Data Engineering is a broad and constantly evolving topic, which makes it difficult to teach in a concise and effective manner. Despite that, Daniel Molnar and Peter Fabian started the Pipeline Academy to do exactly that. In this episode they reflect on the lessons that they learned while teaching the first cohort of their bootcamp how to be effective data engineers. By focusing on the fundamentals, and making everyone write code, they were able to build confidence and impart the importance of context for their students. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch. Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Your host is Tobias Macey and today I’m interviewing Daniel Molnar and Peter Fabian about the lessons that they learned from their first cohort at the Pipeline data engineering academy Interview Introduction How did you get involved in the area of data management? Can you start by sharing the curriculum and learning goals for the students? How did you set a common baseline for all of the students to build from throughout the program? What was your process for determining the structure of the tasks and the tooling used? What were some of the topics/tools that the students had the most difficulty with? What topics/tools were the easiest to grasp? What are some difficulties that you encountered while trying to teach different concepts? How did you deal with the tension of teaching the fundamentals while tying them to toolchains that hiring managers are looking for? What are the successes that you had with this cohort and what changes are you making to your approach/curriculum to build on them? What are some of the failures that you encountered and what lessons have you taken from them? How did the pandemic impact your overall plan and execution of the initial cohort? What were the skills that you focused on for interview preparation? What level of ongoing support/engagement do you have with students once they complete the curriculum? What are the most interesting, innovative, or unexpected solutions that you saw from your students? What are the most interesting, unexpected, or challenging lessons that you have learned while working with your first cohort? When is a bootcamp the wrong approach for skill development? What do you have planned for the future of the Pipeline Academy? Contact Info Daniel LinkedIn Website @soobrosa on Twitter Peter LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Pipeline Academy Blog Scikit Pandas Urchin Kafka Three "C"s – Context, Confidence, and Code Prefect Podcast Episode Great Expectations Podcast Episode Podcast.__init__ Episode Docker Kubernetes Become a Data Engineer On A Shoestring James Mickens The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
undefined
Jun 23, 2021 • 58min

Make Database Performance Optimization A Playful Experience With OtterTune

Summary The database is the core of any system because it holds the data that drives your entire experience. We spend countless hours designing the data model, updating engine versions, and tuning performance. But how confident are you that you have configured it to be as performant as possible, given the dozens of parameters and how they interact with each other? Andy Pavlo researches autonomous database systems, and out of that research he created OtterTune to find the optimal set of parameters to use for your specific workload. In this episode he explains how the system works, the challenge of scaling it to work across different database engines, and his hopes for the future of database systems. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. We’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14-day trial. Your host is Tobias Macey and today I’m interviewing Andy Pavlo about OtterTune, a system to continuously monitor and improve database performance via machine learning Interview Introduction How did you get involved in the area of data management? Can you describe what OtterTune is and the story behind it? How does it relate to your work with NoisePage? What are the challenges that database administrators, operators, and users run into when working with, configuring, and tuning transactional systems? What are some of the contributing factors to the sprawling complexity of the configurable parameters for these databases? Can you describe how OtterTune is implemented? What are some of the aggregate benefits that OtterTune can gain by running as a centralized service and learning from all of the systems that it connects to? What are some of the assumptions that you made when starting the commercialization of this technology that have been challenged or invalidated as you began working with initial customers? How have the design and goals of the system changed or evolved since you first began working on it? What is involved in adding support for a new database engine? How applicable are the OtterTune capabilities to analytical database engines? How do you handle tuning for variable or evolving workloads? What are some of the most interesting or esoteric configuration options that you have come across while working on OtterTune? What are some that made you facepalm? What are the most interesting, innovative, or unexpected ways that you have seen OtterTune used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on OtterTune? When is OtterTune the wrong choice? What do you have planned for the future of OtterTune? Contact Info CMU Page apavlo on GitHub @andy_pavlo on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links OtterTune CMU (Carnegie Mellon University) Brown University Michael Stonebraker H-Store Learned Indexes NoisePage Oracle DB PostgreSQL Podcast Episode MySQL RDS Gaussian Process Model Reinforcement Learning AWS Aurora MVCC (Multi-Version Concurrency Control) Puppet VectorWise GreenPlum Snowflake Podcast Episode PGTune MySQL Tuner SIGMOD The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
undefined
Jun 18, 2021 • 41min

Bring Order To The Chaos Of Your Unstructured Data Assets With Unstruk

Summary Working with unstructured data has typically been a motivation for a data lake. The challenge is imposing enough order on the platform to make it useful. Kirk Marple has spent years working with data systems and the media industry, which inspired him to build a platform for automatically organizing your unstructured assets to make them more valuable. In this episode he shares the goals of the Unstruk Data Warehouse, how it is architected to extract asset metadata and build a searchable knowledge graph from the information, and the myriad ways that the system can be used. If you are wondering how to deal with all of the information that doesn’t fit in your databases or data warehouses, then this episode is for you. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch. Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Your host is Tobias Macey and today I’m interviewing Kirk Marple about Unstruk Data, a company that is building a data warehouse for unstructured data that ofers automated data preparation via metadata enrichment, integrated compute, and graph-based search Interview Introduction How did you get involved in the area of data management? Can you describe what Unstruk Data is and the story behind it? What would you classify as "unstructured data"? What are some examples of industries that rely on large or varied sets of unstructured data? What are the challenges for analytics that are posed by the different categories of unstructured data? What is the current state of the industry for working with unstructured data? What are the unique capabilities that Unstruk provides and how does it integrate with the rest of the ecosystem? Where does it sit in the overall landscape of data tools? Can you describe how the Unstruk data warehouse is implemented? What are the assumptions that you had at the start of this project that have been challenged as you started working through the technical implementation and customer trials? How has the design and architecture evolved or changed since you began working on it? How do you handle versioning of data, given the potential for individual files to be quite large? What are some of the considerations that users should have in mind when modeling their data in the warehouse? Can you talk through the workflow of ingesting and analyzing data with Unstruk? How do you manage data enrichment/integration with structured data sources? What are the most interesting, innovative, or unexpected ways that you have seen the technology of Unstruk used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on and with the Unstruk platform? When is Unstruk the wrong choice? What do you have planned for the future of Unstruk? Contact Info LinkedIn @KirkMarple on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Unstruk Data TIFF ROSBag HDF5 Media/Digital Asset Management Data Mesh SAN NAS Knowledge Graph Entity Extraction OCR (Optical Character Recognition) Cloud Native Cosmos DB Azure Functions Azure EventHub Azure Cognitive Search GraphQL KNative Schema.org Pinecone Vector Database Podcast Episode Dublin Core Metadata Initiative Knowledge Management The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
undefined
Jun 15, 2021 • 1h 6min

Accelerating ML Training And Delivery With In-Database Machine Learning

Summary When you build a machine learning model, the first step is always to load your data. Typically this means downloading files from object storage, or querying a database. To speed up the process, why not build the model inside the database so that you don’t have to move the information? In this episode Paige Roberts explains the benefits of pushing the machine learning processing into the database layer and the approach that Vertica has taken for their implementation. If you are looking for a way to speed up your experimentation, or an easy way to apply AutoML then this conversation is for you. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. We’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14-day trial. Your host is Tobias Macey and today I’m interviewing Paige Roberts about machine learning workflows inside the database Interview Introduction How did you get involved in the area of data management? Can you start by giving an overview of the current state of the market for databases that support in-process machine learning? What are the motivating factors for running a machine learning workflow inside the database? What styles of ML are feasible to do inside the database? (e.g. bayesian inference, deep learning, etc.) What are the performance implications of running a model training pipeline within the database runtime? (both in terms of training performance boosts, and database performance impacts) Can you describe the architecture of how the machine learning process is managed by the database engine? How do you manage interacting with Python/R/Jupyter/etc. when working within the database? What is the impact on data pipeline and MLOps architectures when using the database to manage the machine learning workflow? What are the most interesting, innovative, or unexpected ways that you have seen in-database ML used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on machine learning inside the database? When is in-database ML the wrong choice? What are the recent trends/changes in machine learning for the database that you are excited for? Contact Info LinkedIn Blog @RobertsPaige on Twitter @PaigeEwing on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Vertica SyncSort Hortonworks Infoworld – 8 databases supporting in-database machine learning Power BI Podcast Episode Grafana Tableau K-Means Clustering MPP == Massively Parallel Processing AutoML Random Forest PMML == Predictive Model Markup Language SVM == Support Vector Machine Naive Bayes XGBoost Pytorch Tensorflow Neural Magic Tensorflow Frozen Graph Parquet ORC Avro CNCF == Cloud Native Computing Foundation Hotel California VerticaPy Pandas Podcast.__init__ Episode Jupyter Notebook UDX Unifying Analytics Presentation Hadoop Yarn Holden Karau Spark Vertica Academy The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
undefined
6 snips
Jun 12, 2021 • 53min

Taking A Tour Of The Google Cloud Platform For Data And Analytics

Summary Google pioneered an impressive number of the architectural underpinnings of the broader big data ecosystem. Now they offer the technologies that they run internally to external users of their cloud platform. In this episode Lak Lakshmanan enumerates the variety of services that are available for building your various data processing and analytical systems. He shares some of the common patterns for building pipelines to power business intelligence dashboards, machine learning applications, and data warehouses. If you’ve ever been overwhelmed or confused by the array of services available in the Google Cloud Platform then this episode is for you. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch. Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Your host is Tobias Macey and today I’m interviewing Lak Lakshmanan about the suite of services for data and analytics in Google Cloud Platform. Interview Introduction How did you get involved in the area of data management? Can you start by giving an overview of the tools and products that are offered as part of Google Cloud for data and analytics? How do the various systems relate to each other for building a full workflow? How do you balance the need for clean integration between services with the need to make them useful in isolation when used as a single component of a data platform? What have you found to be the primary motivators for customers who are adopting GCP for some or all of their data workloads? What are some of the challenges that new users of GCP encounter when working with the data and analytics products that it offers? What are the systems that you have found to be easiest to work with? Which are the most challenging to work with, whether due to the kinds of problems that they are solving for, or due to their user experience design? How has your work with customers fed back into the products that you are building on top of? What are some examples of architectural or software patterns that are unique to the GCP product suite? What are the most interesting, innovative, or unexpected ways that you have seen Google Cloud’s data and analytics services used? What are the most interesting, unexpected, or challenging lessons that you have learned while working at Google and helping customers succeed in their data and analytics efforts? What are some of the new capabilities, new services, or industry trends that you are most excited for? Contact Info LinkedIn @lak_gcp on Twitter Website Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Google Cloud Data and Analytics Services Forrester Wave Dremel BigQuery MapReduce Cloud Spanner Spanner Paper Hadoop Tensorflow Google Cloud SQL Apache Spark Dataproc Dataflow Apache Beam Databricks Mixpanel Avalanche data warehouse Kubernetes GKE (Google Kubernetes Engine) Google Cloud Run Android Youtube Google Translate Teradata Power BI Podcast Episode AI Platform Notebooks GitHub Data Repository Stack Overflow Questions Data Repository PyPI Download Statistics Recommendations AI Pub/Sub Bigtable Datastream Change Data Capture Podcast Episode About Debezium for CDC Podcast Episode About CDC with Datacoral Document AI Google Meet Data Governance Podcast Episodes The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app