Data Engineering Podcast

Tobias Macey
undefined
Nov 10, 2020 • 52min

Building A Cost Effective Data Catalog With Tree Schema

Summary A data catalog is a critical piece of infrastructure for any organization who wants to build analytics products, whether internal or external. While there are a number of platforms available for building that catalog, many of them are either difficult to deploy and integrate, or expensive to use at scale. In this episode Grant Seward explains how he built Tree Schema to be an easy to use and cost effective option for organizations to build their data catalogs. He also shares the internal architecture, how he approached the design to make it accessible and easy to use, and how it autodiscovers the schemas and metadata for your source systems. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Follow go.datafold.com/dataengineeringpodcast to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask. Are you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud. Our comprehensive data-level security, auditing and de-identification features eliminate the need for time-consuming manual processes and our focus on data and compliance team collaboration empowers you to deliver quick and valuable data analytics on the most sensitive data to unlock the full potential of your cloud data platforms. Learn how we streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta. Your host is Tobias Macey and today I’m interviewing Grant Seward about Tree Schema, a human friendly data catalog Interview Introduction How did you get involved in the area of data management? Can you start by giving an overview of what you have built at Tree Schema? What was your motivation for creating it? At what stage of maturity should a team or organization consider a data catalog to be a necessary component in their data platform? There are a large and growing number of projects and products designed to provide a data catalog, with each of them addressing the problem in a slightly different way. What are the necessary elements for a data catalog? How does Tree Schema compare to the available options? (e.g. Amundsen, Company Wiki, Metacat, Metamapper, etc.) How is the Tree Schema system implemented? How has the design or direction of Tree Schema evolved since you first began working on it? How did you approach the schema definitions for defining entities? What was your guiding heuristic for determining how to design the interface and data models? – I wrote down notes that combine this with the question above How do you handle integrating with data sources? In addition to storing schema information you allow users to store information about the transformations being performed. How is that represented? How can users populate information about their transformations in an automated fashion? How do you approach evolution and versioning of schema information? What are the scaling limitations of tree schema, whether in terms of the technical or cognitive complexity that it can handle? What are some of the most interesting, innovative, or unexpected ways that you have seen Tree Schema being used? What have you found to be the most interesting, unexpected, or challenging lessons learned in the process of building and promoting Tree Schema? When is Tree Schema the wrong choice? What do you have planned for the future of the product? Contact Info Email Linkedin Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Tree Schema Tree Schema – Data Lineage as Code Capital One Walmart Labs Data Catalog Data Discovery Amundsen Metacat Marquez Metamapper Infoworks Collibra Faust Podcast.__init__ Episode Django PostgreSQL Redis Celery Amazon ECS (Elastic Container Service) Django Storages Dagster Airflow DataHub Avro Singer Apache Atlas The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
undefined
Nov 3, 2020 • 50min

Add Version Control To Your Data Lake With LakeFS

Summary Data lakes are gaining popularity due to their flexibility and reduced cost of storage. Along with the benefits there are some additional complexities to consider, including how to safely integrate new data sources or test out changes to existing pipelines. In order to address these challenges the team at Treeverse created LakeFS to introduce version control capabilities to your storage layer. In this episode Einat Orr and Oz Katz explain how they implemented branching and merging capabilities for object storage, best practices for how to use versioning primitives to introduce changes to your data lake, how LakeFS is architected, and how you can start using it for your own data platform. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Are you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud. Our comprehensive data-level security, auditing and de-identification features eliminate the need for time-consuming manual processes and our focus on data and compliance team collaboration empowers you to deliver quick and valuable data analytics on the most sensitive data to unlock the full potential of your cloud data platforms. Learn how we streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta. Today’s episode of the Data Engineering Podcast is sponsored by Datadog, a SaaS-based monitoring and analytics platform for cloud-scale infrastructure, applications, logs, and more. Datadog uses machine-learning based algorithms to detect errors and anomalies across your entire stack—which reduces the time it takes to detect and address outages and helps promote collaboration between Data Engineering, Operations, and the rest of the company. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial. If you start a trial and install Datadog’s agent, Datadog will send you a free T-shirt. Your host is Tobias Macey and today I’m interviewing Einat Orr and Oz Katz about their work at Treeverse on the LakeFS system for versioning your data lakes the same way you version your code. Interview Introduction How did you get involved in the area of data management? Can you start by giving an overview of what LakeFS is and why you built it? There are a number of tools and platforms that support data virtualization and data versioning. How does LakeFS compare to the available options? (e.g. Alluxio, Denodo, Pachyderm, DVC, etc.) What are the primary use cases that LakeFS enables? For someone who wants to use LakeFS what is involved in getting it set up? How is LakeFS implemented? How has the design of the system changed or evolved since you began working on it? What assumptions did you have going into it which have since been invalidated or modified? How does the workflow for an engineer or analyst change from working directly against S3 to running against the LakeFS interface? How do you handle merge conflicts and resolution? What are some of the potential edge cases or foot guns that they should be aware of when there are multiple people using the same repository? How do you approach management of the data lifecycle or garbage collection to avoid ballooning the cost of storage for a dataset that is tracking a high number of branches with diverging commits? Given that S3 and GCS are eventually consistent storage layers, how do you handle snapshots/transactionality of the data you are working with? What are the axes for scaling an installation of LakeFS? What are the limitations in terms of size or geographic distribution of the datasets? What are some of the most interesting, unexpected, or innovative ways that you have seen LakeFS being used? What are the most interesting, unexpected, or challenging lessons that you have learned while building LakeFS? When is LakeFS the wrong choice? What do you have planned for the future of the project? Contact Info Einat Orr Oz Katz Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Treeverse LakeFS GitHub Documentation lakeFS Slack Channel SimilarWeb Kaggle DagsHub Alluxio Pachyderm DVC ML Ops (Machine Learning Operations) DoltHub Delta Lake Podcast Episode Hudi Iceberg Table Format Podcast Episode Kubernetes PostgreSQL Podcast Episode Git Spark Presto CockroachDB YugabyteDB Citus Hive Metastore Iceberg Table Format Immunai The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
undefined
6 snips
Oct 26, 2020 • 49min

Cloud Native Data Security As Code With Cyral

Summary One of the most challenging aspects of building a data platform has nothing to do with pipelines and transformations. If you are putting your workflows into production, then you need to consider how you are going to implement data security, including access controls and auditing. Different databases and storage systems all have their own method of restricting access, and they are not all compatible with each other. In order to simplify the process of securing your data in the Cloud Manav Mital created Cyral to provide a way of enforcing security as code. In this episode he explains how the system is architected, how it can help you enforce compliance, and what is involved in getting it integrated with your existing systems. This was a good conversation about an aspect of data management that is too often left as an afterthought. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Are you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud. Our comprehensive data-level security, auditing and de-identification features eliminate the need for time-consuming manual processes and our focus on data and compliance team collaboration empowers you to deliver quick and valuable data analytics on the most sensitive data to unlock the full potential of your cloud data platforms. Learn how we streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta. Today’s episode of the Data Engineering Podcast is sponsored by Datadog, a SaaS-based monitoring and analytics platform for cloud-scale infrastructure, applications, logs, and more. Datadog uses machine-learning based algorithms to detect errors and anomalies across your entire stack—which reduces the time it takes to detect and address outages and helps promote collaboration between Data Engineering, Operations, and the rest of the company. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial. If you start a trial and install Datadog’s agent, Datadog will send you a free T-shirt. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today! Your host is Tobias Macey and today I’m interviewing Manav Mital about the challenges involved in securing your data and the work that he is doing at Cyral to help address those problems. Interview Introduction How did you get involved in the area of data management? What is Cyral and what motivated you to build a business focused on addressing data security in the cloud? Can you start by giving an overview of some of the common security issues that occur when working with data? What new security challenges are introduced by building data platforms in public cloud environments? What are the organizational roles that are typically responsible for managing security and access control to data sources and repositories? What are the tensions, technical or organizational, that lead to a problematic or incomplete security posture? What are the differences in security requirements and implementation complexity between software applications and data systems? What are the data systems that Cyral integrates with? How did you determine what platforms to prioritize? How does Cyral integrate into the toolchains used to deploy, maintain, and upgrade an organization’s data infrastructure? How does the Cyral platform address security and access control of data across an organization’s infrastructure? How are schema changes handled when using Cyral to enforce access control to PII or other attributes? How does Cyral help with reducing sprawl of data across unmonitored systems? What are some of the most interesting, unexpected, or challenging lessons that you learned while building Cyral? When is Cyral the wrong choice? What do you have planned for the future of the Cyral platform? Contact Info LinkedIn @manavrm on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Cyral Snowflake Podcast Episode BigQuery Object Storage MongoDB The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
undefined
Oct 19, 2020 • 56min

Better Data Quality Through Observability With Monte Carlo

Summary In order for analytics and machine learning projects to be useful, they require a high degree of data quality. To ensure that your pipelines are healthy you need a way to make them observable. In this episode Barr Moses and Lior Gavish, co-founders of Monte Carlo, share the leading causes of what they refer to as data downtime and how it manifests. They also discuss methods for gaining visibility into the flow of data through your infrastructure, how to diagnose and prevent potential problems, and what they are building at Monte Carlo to help you maintain your data’s uptime. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Are you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud. Our comprehensive data-level security, auditing and de-identification features eliminate the need for time-consuming manual processes and our focus on data and compliance team collaboration empowers you to deliver quick and valuable data analytics on the most sensitive data to unlock the full potential of your cloud data platforms. Learn how we streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta. Today’s episode of the Data Engineering Podcast is sponsored by Datadog, a SaaS-based monitoring and analytics platform for cloud-scale infrastructure, applications, logs, and more. Datadog uses machine-learning based algorithms to detect errors and anomalies across your entire stack—which reduces the time it takes to detect and address outages and helps promote collaboration between Data Engineering, Operations, and the rest of the company. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial. If you start a trial and install Datadog’s agent, Datadog will send you a free T-shirt. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today! Your host is Tobias Macey and today I’m interviewing Barr Moses and Lior Gavish about observability for your data pipelines and how they are addressing it at Monte Carlo. Interview Introduction How did you get involved in the area of data management? How did you come up with the idea to found Monte Carlo? What is "data downtime"? Can you start by giving your definition of observability in the context of data workflows? What are some of the contributing factors that lead to poor data quality at the different stages of the lifecycle? Monitoring and observability of infrastructure and software applications is a well understood problem. In what ways does observability of data applications differ from "traditional" software systems? What are some of the metrics or signals that we should be looking at to identify problems in our data applications? Why is this the year that so many companies are working to address the issue of data quality and observability? How are you addressing the challenge of bringing observability to data platforms at Monte Carlo? What are the areas of integration that you are targeting and how did you identify where to prioritize your efforts? For someone who is using Monte Carlo, how does the platform help them to identify and resolve issues in their data? What stage of the data lifecycle have you found to be the biggest contributor to downtime and quality issues? What are the most challenging systems, platforms, or tool chains to gain visibility into? What are some of the most interesting, innovative, or unexpected ways that you have seen teams address their observability needs? What are the most interesting, unexpected, or challenging lessons that you have learned while building the business and technology of Monte Carlo? What are the alternatives to Monte Carlo? What do you have planned for the future of the platform? Contact Info Visit www.montecarlodata.com?utm_source=rss&utm_medium=rss to lean more about our data reliability platform; Or reach out directly to barr@montecarlodata.com — happy to chat about all things data! Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Monte Carlo Monte Carlo Platform Observability Gainsight Barracuda Networks DevOps New Relic Datadog Netflix RAD Outlier Detection The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
undefined
11 snips
Oct 12, 2020 • 1h 3min

Rapid Delivery Of Business Intelligence Using Power BI

Summary Business intelligence efforts are only as useful as the outcomes that they inform. Power BI aims to reduce the time and effort required to go from information to action by providing an interface that encourages rapid iteration. In this episode Rob Collie shares his enthusiasm for the Power BI platform and how it stands out from other options. He explains how he helped to build the platform during his time at Microsoft, and how he continues to support users through his work at Power Pivot Pro. Rob shares some useful insights gained through his consulting work, and why he considers Power BI to be the best option on the market today for business analytics. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Are you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud. Our comprehensive data-level security, auditing and de-identification features eliminate the need for time-consuming manual processes and our focus on data and compliance team collaboration empowers you to deliver quick and valuable data analytics on the most sensitive data to unlock the full potential of your cloud data platforms. Learn how we streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta. Equalum’s end to end data ingestion platform is relied upon by enterprises across industries to seamlessly stream data to operational, real-time analytics and machine learning environments. Equalum combines streaming Change Data Capture, replication, complex transformations, batch processing and full data management using a no-code UI. Equalum also leverages open source data frameworks by orchestrating Apache Spark, Kafka and others under the hood. Tool consolidation and linear scalability without the legacy platform price tag. Go to dataengineeringpodcast.com/equalum today to start a free 2 week test run of their platform, and don’t forget to tell them that we sent you. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today! Your host is Tobias Macey and today I’m interviewing Rob Collie about Microsoft’s Power BI platform and his work at Power Pivot Pro to help users employ it effectively. Interview Introduction How did you get involved in the area of data management? Can you start by giving an overview of what Power BI is? The business intelligence market is fairly crowded. What are the features of Power BI that make it stand out? Who are the target users of Power BI? How does the design of the platform reflect those priorities? Can you talk through the workflow for someone to build a report or dashboard in Power BI? What is the broader ecosystem of data tools and platforms that Power BI sits within? What are the available integration and extension points for Power BI? In addition to your work at Microsoft building Power BI you now run a consulting company dedicated to helping people adopt that platform. What are some of the common challenges that users face in employing Power BI effectively? In your experience working with clients, what are some of the core principles of data processing and visualization that apply across industries? What are some of the modeling or presentation methods that are specific to a given industry? One of the perennial challenges of business intelligence is to make reports discoverable. What facilities does Power BI have to aid in surfacing useful information to end users? What capabilities does Power BI have for exposing elements of data quality? What are some of the most challenging aspects of building and maintaining a business intelligence effort in an organization? What are some of the most interesting, unexpected, or innovative uses of Power BI that you have seen, or projects that you have worked on? What are some of the most interesting, unexpected, or challenging lessons that you have learned in your work building Power BI and building a business to support its users? When is Power BI the wrong choice? What trends in business intelligence are you most excited by? Contact Info LinkedIn @robocolli3 on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links P3 Power BI Microsoft Excel Fantasy Football Excel Functions Lisp Business Intelligence VLOOKUP Looker Podcast Episode SQL Server Reporting Services SQL Server Analysis Services Tableau Master Data Management ERP == Enterprise Resoure Planning M Language Power Query DAX The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
undefined
Oct 5, 2020 • 1h 1min

Self Service Real Time Data Integration Without The Headaches With Meroxa

Summary Analytical workloads require a well engineered and well maintained data integration process to ensure that your information is reliable and up to date. Building a real-time pipeline for your data lakes and data warehouses is a non-trivial effort, requiring a substantial investment of time and energy. Meroxa is a new platform that aims to automate the heavy lifting of change data capture, monitoring, and data loading. In this episode founders DeVaris Brown and Ali Hamidi explain how their tenure at Heroku informed their approach to making data integration self service, how the platform is architected, and how they have designed their system to adapt to the continued evolution of the data ecosystem. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Are you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud. Our comprehensive data-level security, auditing and de-identification features eliminate the need for time-consuming manual processes and our focus on data and compliance team collaboration empowers you to deliver quick and valuable data analytics on the most sensitive data to unlock the full potential of your cloud data platforms. Learn how we streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta. Today’s episode of the Data Engineering Podcast is sponsored by Datadog, a SaaS-based monitoring and analytics platform for cloud-scale infrastructure, applications, logs, and more. Datadog uses machine-learning based algorithms to detect errors and anomalies across your entire stack—which reduces the time it takes to detect and address outages and helps promote collaboration between Data Engineering, Operations, and the rest of the company. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial. If you start a trial and install Datadog’s agent, Datadog will send you a free T-shirt. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today! Your host is Tobias Macey and today I’m interviewing DeVaris Brown and Ali Hamidi about Meroxa, a new platform as a service for data integration Interview Introduction How did you get involved in the area of data management? Can you start by describing what you are building at Meroxa and what motivated you to turn it into a business? What are the lessons that you learned from your time at Heroku which you are applying to your work on Meroxa? Who are your target users and what are your guiding principles for designing the platform interface? What are the common difficulties that engineers face in building and maintaining data infrastructure? There are a variety of platforms that offer solutions for managing data integration, or powering end-to-end analytics, or building machine learning pipelines. What are the shortcomings of those existing options that might lead someone to choose Meroxa? How is the Meroxa platform architected? What are some of the initial assumptions that you had which have been challenged as you proceed with implementation? What new capabilities does Meroxa bring to someone who uses it for integrating their application data? What are the growth options for organizations that get started with Meroxa? What are the core principles that you are focused on to allow for evolving your platform over the long run as the surrounding ecosystem continues to mature? When is Meroxa the wrong choice? What do you have planned for the future? Contact Info DeVaris Brown Ali Hamidi Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Meroxa Heroku Heroku Kafka Ascend StreamSets Nexus Kafka Connect Airflow Podcast.__init__ Episode Spark Data Engineering Episode Change Data Capture Segment Podcast Episode Rudderstack MParticle Debezium Podcast Episode DBT Podcast Episode Materialize Podcast Episode Stitch Data Fivetran Podcast Episode Elasticsearch Podcast Episode gRPC GraphQL REST == REpresentational State Transfer Dagster/Elementl Data Engineering Podcast Episode Podcast.__init__ Episode Prefect Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
undefined
Sep 29, 2020 • 60min

Speed Up And Simplify Your Streaming Data Workloads With Red Panda

Summary Kafka has become a de facto standard interface for building decoupled systems and working with streaming data. Despite its widespread popularity, there are numerous accounts of the difficulty that operators face in keeping it reliable and performant, or trying to scale an installation. To make the benefits of the Kafka ecosystem more accessible and reduce the operational burden, Alexander Gallego and his team at Vectorized created the Red Panda engine. In this episode he explains how they engineered a drop-in replacement for Kafka, replicating the numerous APIs, that can scale more easily and deliver consistently low latencies with a much lower hardware footprint. He also shares some of the areas of innovation that they have found to help foster the next wave of streaming applications while working within the constraints of the existing Kafka interfaces. This was a fascinating conversation with an energetic and enthusiastic engineer and founder about the challenges and opportunities in the realm of streaming data. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Are you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud. Our comprehensive data-level security, auditing and de-identification features eliminate the need for time-consuming manual processes and our focus on data and compliance team collaboration empowers you to deliver quick and valuable data analytics on the most sensitive data to unlock the full potential of your cloud data platforms. Learn how we streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta. If you’re looking for a way to optimize your data engineering pipeline – with instant query performance – look no further than Qubz. Qubz is next-generation OLAP technology built for the scale of Big Data from UST Global, a renowned digital services provider. Qubz lets users and enterprises analyze data on the cloud and on-premise, with blazing speed, while eliminating the complex engineering required to operationalize analytics at scale. With an emphasis on visual data engineering, connectors for all major BI tools and data sources, Qubz allow users to query OLAP cubes with sub-second response times on hundreds of billions of rows. To learn more, and sign up for a free demo, visit dataengineeringpodcast.com/qubz. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today! Your host is Tobias Macey and today I’m interviewing Alexander Gallego about his work at Vectorized building Red Panda as a performance optimized, drop-in replacement for Kafka Interview Introduction How did you get involved in the area of data management? Can you start by describing what Red Panda is and what motivated you to create it? What are the limitations of Kafka that make something like Red Panda necessary? What are the current strengths of the Kafka ecosystem that make it a reasonable implementation target for Red Panda? How is Red Panda architected? How has the design or direction changed or evolved since you first began working on it? What are the challenges that you face in automatically optimizing the runtime to take advantage of the hardware that it is deployed on? How do cloud environments contribute to that complexity? How are you handling the compatibility layer for the Kafka API? What is your approach for managing versioning and ensuring that you maintain bug compatibility? Beyond performance, what other areas of innovation or improvement in the capabilities and experience do you see while adhering to the Kafka protocol? What are the opportunities for innovation in the streaming space that aren’t being explored yet? What are some of the most interesting, innovative, or unexpected ways that you have seen Redpanda being used? What are the most interesting, unexpected, or challenging lessons that you have learned while building Red Panda and Vectorized? When is Red Panda the wrong choice? What do you have planned for the future of the product and business? What is your Hack The Planet diversity scholarship? Contact Info @emaxerrno on Twitter LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Vectorized Free Download Trial @vectorizedio Company Twitter Accn’t Community Slack Concord alternative to Flink Apache Flink Podcast Episode FAANG == Facebook, Apple, Amazon, Netflix, and Google Blackblaze Raft NATS Pulsar Podcast Episode StreamNative Podcast Episode Open Messaging Specification ScyllaDB CockroachDB MemSQL WASM == Web Assembly Debezium Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
undefined
Sep 22, 2020 • 48min

Cutting Through The Noise And Focusing On The Fundamentals Of Data Engineering With The Data Janitor

Summary Data engineering is a constantly growing and evolving discipline. There are always new tools, systems, and design patterns to learn, which leads to a great deal of confusion for newcomers. Daniel Molnar has dedicated his time to helping data professionals get back to basics through presentations at conferences and meetups, and with his most recent endeavor of building the Pipeline Data Engineering Academy. In this episode he shares advice on how to cut through the noise, which principles are foundational to building a successful career as a data engineer, and his approach to educating the next generation of data practitioners. This was a useful conversation for anyone working with data who has found themselves spending too much time chasing the latest trends and wishes to develop a more focused approach to their work. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Are you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud. Our comprehensive data-level security, auditing and de-identification features eliminate the need for time-consuming manual processes and our focus on data and compliance team collaboration empowers you to deliver quick and valuable data analytics on the most sensitive data to unlock the full potential of your cloud data platforms. Learn how we streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today! Your host is Tobias Macey and today I’m interviewing Daniel Molnar about being a data janitor and how to cut through the hype to understand what to learn for the long run Interview Introduction How did you get involved in the area of data management? Can you start by describing your thoughts on the current state of the data management industry? What is your strategy for being effective in the face of so much complexity and conflicting needs for data? What are some of the common difficulties that you see data engineers contend with, whether technical or social/organizational? What are the core fundamentals that you think are necessary for data engineers to be effective? What are the gaps in knowledge or experience that you have seen data engineers contend with? You recently started down the path of building a bootcamp for training data engineers. What was your motivation for embarking on that journey? How would you characterize your particular approach? What are some of the reasons that your applicants have for wanting to become versed in data engineering? What is the baseline of capabilities that you expect of your target audience? What level of proficiency do you aim for when someone has completed your training program? Who do you think would not be a good fit for your academy? As a hiring manager, what are the core capabilities that you look for in a data engineering candidate? What are some of the methods that you use to assess competence? What are the overall trends in the data management space that you are worried by? Which ones are you happy about? What are your plans and overall goals for the pipeline academy? Contact Info LinkedIn @soobrosa on Twitter Website Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Pipeline Data Engineering Academy Data Janitor 101 The Data Janitor Returns Berlin, Germany Hungary Urchin google analytics precursor AWS Redshift Nassim Nicholas Taleb Black Swans (affiliate link) KISS == Keep It Simple Stupid Dan McKinley Ralph Kimball Data Warehousing design Falsehoods Programmers Believe Apache Kafka AWS Kinesis ETL/ELT CI/CD Telemetry Dêpeche Mode Designing Data Intensive Applications (affiliate link) Stop Hiring DevOps Engineers and Start Growing Them T Shaped Engineer Pipeline Data Engineering Academy Curriculum MPP == Massively Parallel Processing Apache Flink Podcast Episode Flask web framework YAGNI == You Ain’t Gonna Need It Pair Programming Clojure The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
undefined
Sep 15, 2020 • 44min

Distributed In Memory Processing And Streaming With Hazelcast

Summary In memory computing provides significant performance benefits, but brings along challenges for managing failures and scaling up. Hazelcast is a platform for managing stateful in-memory storage and computation across a distributed cluster of commodity hardware. On top of this foundation, the Hazelcast team has also built a streaming platform for reliable high throughput data transmission. In this episode Dale Kim shares how Hazelcast is implemented, the use cases that it enables, and how it complements on-disk data management systems. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Tree Schema is a data catalog that is making metadata management accessible to everyone. With Tree Schema you can create your data catalog and have it fully populated in under five minutes when using one of the many automated adapters that can connect directly to your data stores. Tree Schema includes essential cataloging features such as first class support for both tabular and unstructured data, data lineage, rich text documentation, asset tagging and more. Built from the ground up with a focus on the intersection of people and data, your entire team will find it easier to foster collaboration around your data. With the most transparent pricing in the industry – $99/mo for your entire company – and a money-back guarantee for excellent service, you’ll love Tree Schema as much as you love your data. Go to dataengineeringpodcast.com/treeschema today to get your first month free, and mention this podcast to get %50 off your first three months after the trial. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today! Your host is Tobias Macey and today I’m interviewing Dale Kim about Hazelcast, a distributed in-memory computing platform for data intensive applications Interview Introduction How did you get involved in the area of data management? Can you start by describing what Hazelcast is and its origins? What are the benefits and tradeoffs of in-memory computation for data-intensive workloads? What are some of the common use cases for the Hazelcast in memory grid? How is Hazelcast implemented? How has the architecture evolved since it was first created? How is the Jet streaming framework architected? What was the motivation for building it? How do the capabilities of Jet compare to systems such as Flink or Spark Streaming? How has the introduction of hardware capabilities such as NVMe drives influenced the market for in-memory systems? How is the governance of the open source grid and Jet projects handled? What is the guiding heuristic for which capabilities or features to include in the open source projects vs. the commercial offerings? What is involved in building an application or workflow on top of Hazelcast? What are the common patterns for engineers who are building on top of Hazelcast? What is involved in deploying and maintaining an installation of the Hazelcast grid or Jet streaming? What are the scaling factors for Hazelcast? What are the edge cases that users should be aware of? What are some of the most interesting, innovative, or unexpected ways that you have seen Hazelcast used? When is Hazelcast Grid or Jet the wrong choice? What is in store for the future of Hazelcast? Contact Info LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links HazelCast Istanbul Apache Spark OrientDB CAP Theorem NVMe Memristors Intel Optane Persistent Memory Hazelcast Jet Kappa Architecture IBM Cloud Paks Digital Integration Hub (Gartner) The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
undefined
Sep 7, 2020 • 54min

Simplify Your Data Architecture With The Presto Distributed SQL Engine

Summary Databases are limited in scope to the information that they directly contain. For analytical use cases you often want to combine data across multiple sources and storage locations. This frequently requires cumbersome and time-consuming data integration. To address this problem Martin Traverso and his colleagues at Facebook built the Presto distributed query engine. In this episode he explains how it is designed to allow for querying and combining data where it resides, the use cases that such an architecture unlocks, and the innovative ways that it is being employed at companies across the world. If you need to work with data in your cloud data lake, your on-premise database, or a collection of flat files, then give this episode a listen and then try out Presto today. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today! Your host is Tobias Macey and today I’m interviewing Martin Traverso about PrestoSQL, a distributed SQL engine that queries data in place Interview Introduction How did you get involved in the area of data management? Can you start by giving an overview of what Presto is and its origin story? What was the motivation for releasing Presto as open source? For someone who is responsible for architecting their organization’s data platform, what are some of the signals that Presto will be a good fit for them? What are the primary ways that Presto is being used? I interviewed your colleague at Starburst, Kamil 2 years ago. How has Presto changed or evolved in that time, both technically and in terms of community and ecosystem growth? What are some of the deployment and scaling considerations that operators of Presto should be aware of? What are the best practices that have been established for working with data through Presto in terms of centralizing in a data lake vs. federating across disparate storage locations? What are the tradeoffs of using Presto on top of a data lake vs a vertically integrated warehouse solution? When designing the layout of a data lake that will be interacted with via Presto, what are some of the data modeling considerations that can improve the odds of success? What are some of the most interesting, unexpected, or innovative ways that you have seen Presto used? What are the most interesting, unexpected, or challenging lessons that you have learned while building, growing, and supporting the Presto project? When is Presto the wrong choice? What is in store for the future of the Presto project and community? Contact Info LinkedIn @mtraverso on Twitter martint on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Presto Starburst Data Podcast Episode Hadoop Hive Glue Metastore BigQuery Kinesis Apache Pinot Elasticsearch ORC Parquet AWS Redshift Avro Podcast Episode LZ4 Zstandard KafkaSQL Flink Podcast Episode PyTorch Podcast.__init__ Episode Tensorflow Spark The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app