Data Engineering Podcast

Tobias Macey
undefined
Oct 8, 2021 • 44min

Make Your Business Metrics Reusable With Open Source Headless BI Using Metriql

Summary The key to making data valuable to business users is the ability to calculate meaningful metrics and explore them along useful dimensions. Business intelligence tools have provided this capability for years, but they don’t offer a means of exposing those metrics to other systems. Metriql is an open source project that provides a headless BI system where you can define your metrics and share them with all of your other processes. In this episode Burak Kabakcı shares the story behind the project, how you can use it to create your metrics definitions, and the benefits of treating the semantic layer as a dedicated component of your platform. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask. Your host is Tobias Macey and today I’m interviewing Burak Emre Kabakcı about Metriql, a headless BI and metrics layer for your data stack Interview Introduction How did you get involved in the area of data management? Can you describe what Metriql is and the story behind it? What are the characteristics and benefits of a "headless BI" system? What was your motivation to create and open-source Metriql as an independent project outside of your business? How are you approaching governance and sustainability of the project? How does Metriql compare to projects such as AirBnB’s Minerva or Transform’s platform? How does the industry/vertical of a business impact their ability to benefit from a metrics layer/headless BI? What are the limitations to the logical complexity that can be applied to the calculation of a given metric/set of metrics? Can you describe how Metriql is implemented? How have the design and goals of the project changed or evolved since you began working on it? What are the most complex/difficult engineering elements of building a metrics layer? Can you describe the workflow of defining metrics? What have been your guiding principles in defining the user experience for working with metriql? What are the opportunities for including business users in the definition of metrics? (e.g. pushing down/generating definitions from a BI layer) What are the biggest challenges and limitations of creating metrics definitions purely in SQL? What are the options for exposing metrics back to the warehouse and other operational systems such as reverse ETL vendors? What are the missing elements in the data ecosystem for taking full advantage of a headless BI/metrics layer? What are the most interesting, innovative, or unexpected ways that you have seen Metriql used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Metriql? When is Metriql the wrong choice? What do you have planned for the future of Metriql? Contact Info LinkedIn Website buremba on GitHub @bu7emba on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Metriql Rakam Hazelcast Headless BI Google Data Studio Superset Podcast Episode Podcast.__init__ Episode Trino Podcast Episode Supergrain The Missing Piece Of The Modern Data Stack article by Benn Stancil Metabase Podcast Episode dbt Podcast Episode dbt-metabase re_data OpenMetadata The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
undefined
Oct 6, 2021 • 46min

Adding Support For Distributed Transactions To The Redpanda Streaming Engine

Summary Transactions are a necessary feature for ensuring that a set of actions are all performed as a single unit of work. In streaming systems this is necessary to ensure that a set of messages or transformations are all executed together across different queues. In this episode Denis Rystsov explains how he added support for transactions to the Redpanda streaming engine. He discusses the use cases for transactions, the different strategies, semantics, and guarantees that they might need to support, and how his implementation ended up improving the performance of bulk write operations. This is an interesting deep dive into the internals of a high performance streaming engine and the details that are involved in building distributed systems. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/impact today to save your spot at IMPACT: The Data Observability Summit a half-day virtual event featuring the first U.S. Chief Data Scientist, founder of the Data Mesh, Creator of Apache Airflow, and more data pioneers spearheading some of the biggest movements in data. The first 50 to RSVP with this link will be entered to win an Oculus Quest 2 — Advanced All-In-One Virtual Reality Headset. RSVP today – you don’t want to miss it! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Your host is Tobias Macey and today I’m interviewing Denis Rystsov about implementing transactions in the RedPanda streaming engine Interview Introduction How did you get involved in the area of data management? Can you quickly recap what RedPanda is and the goals of the project? What are the use cases for transactions in a pub/sub messaging system? What are the elements of streaming systems that make atomic transactions a complex problem? What was the motivation for starting down the path of adding transactions to the RedPanda engine? How did the constraint of supporting the Kafka API influence your implementation strategy for transaction semantics? Can you talk through the details of how you ended up implementing transactions in RedPanda? What are some of the roadblocks and complexities that you encountered while working through the implementation? How did you approach the validation and verification of the transactions? What other features or capabilities are you planning to work on next? What are the most interesting, innovative, or unexpected ways that you have seen transactions in RedPanda used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on transactions for RedPanda? When are transactions the wrong choice? What do you have planned for the future of transaction support in RedPanda? Contact Info @rystsov on twitter LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Vectorized RedPanda Podcast Episode RedPanda Transactions Post Yandex Cassandra MongoDB Riak Cosmos DB Jepsen Podcast Episode Testing Shared Memories paper Journal of Systems Research Kafka Pulsar Seastar Framework CockroachDB Podcast Episode TiDB Calvin Paper Polyjuice Paper Parallel Commit Chaos Testing Matchmaker Paxos Algorithm The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
undefined
Oct 2, 2021 • 1h 8min

Building Real-Time Data Platforms For Large Volumes Of Information With Aerospike

Summary Aerospike is a database engine that is designed to provide millisecond response times for queries across terabytes or petabytes. In this episode Chief Strategy Officer, Lenley Hensarling, explains how the ability to process these large volumes of information in real-time allows businesses to unlock entirely new capabilities. He also discusses the technical implementation that allows for such extreme performance and how the data model contributes to the scalability of the system. If you need to deal with massive data, at high velocities, in milliseconds, then Aerospike is definitely worth learning about. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold’s proactive approach to data quality helps data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold. Your host is Tobias Macey and today I’m interviewing Lenley Hensarling about Aerospike and building real-time data platforms Interview Introduction How did you get involved in the area of data management? Can you describe what Aerospike is and the story behind it? What are the use cases that it is uniquely well suited for? What are the use cases that you and the Aerospike team are focusing on and how does that influence your focus on priorities of feature development and user experience? What are the driving factors for building a real-time data platform? How is Aerospike being incorporated in application and data architectures? Can you describe how the Aerospike engine is architected? How have the design and architecture changed or evolved since it was first created? How have market forces influenced the product priorities and focus? What are the challenges that end users face when determining how to model their data given a key/value storage interface? What are the abstraction layers that you and/or your users build to manage reliational or hierarchical data architectures? What are the operational characteristics of the Aerospike system? (e.g. deployment, scaling, CP vs AP, upgrades, clustering, etc.) What are the most interesting, innovative, or unexpected ways that you have seen Aerospike used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Aerospike? When is Aerospike the wrong choice? What do you have planned for the future of Aerospike? Contact Info LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Aerospike GitHub EnterpriseDB "Nobody Expects The Spanish Inquisition" ARM CPU Architectures AWS Graviton Processors The Datacenter Is The Computer (Affiliate link) Jepsen Tests Podcast Episode Cloud Native Computing Foundation Prometheus Grafana OpenTelemetry Podcast.__init__ Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
undefined
Sep 30, 2021 • 1h 12min

Delivering Your Personal Data Cloud With Prifina

Summary The promise of online services is that they will make your life easier in exchange for collecting data about you. The reality is that they use more information than you realize for purposes that are not what you intended. There have been many attempts to harness all of the data that you generate for gaining useful insights about yourself, but they are generally difficult to set up and manage or require software development experience. The team at Prifina have built a platform that allows users to create their own personal data cloud and install applications built by developers that power useful experiences while keeping you in full control. In this episode Markus Lampinen shares the goals and vision of the company, the technical aspects of making it a reality, and the future vision for how services can be designed to respect user’s privacy while still providing compelling experiences. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/impact today to save your spot at IMPACT: The Data Observability Summit a half-day virtual event featuring the first U.S. Chief Data Scientist, founder of the Data Mesh, Creator of Apache Airflow, and more data pioneers spearheading some of the biggest movements in data. The first 50 to RSVP with this link will be entered to win an Oculus Quest 2 — Advanced All-In-One Virtual Reality Headset. RSVP today – you don’t want to miss it! Your host is Tobias Macey and today I’m interviewing Markus Lampinen about Prifina, a platform for building applications powered by personal data that is under the user’s control Interview Introduction How did you get involved in the area of data management? Can you describe what Prifina is and the story behind it? What are the primary goals of Prifina? There has been a lof of interest in the "quantified self" and different projects (many that are open source) which aim to aggregate all of a user’s data into a single system for analysis and integration. What was lacking in the ecosystem that makes Prifina necessary/valuable? What are some of the personalized applications for this data that have been most compelling or that users are most interested in? What are the sources of complexity that you are facing when managing access/privacy of user’s data? Can you describe the architecture of the platform that you are building? What are the technological/social/economic underpinnings that are necessary to make a platform like Prifina possible? What are the assumptions that you had when you first became involved in the project which have been challenged or invalidated as you worked through the implementation and began engaging with users and developers? How do you approach schema definition/management for developers to have a stable implementation target? How has that schema evolved as you introduced new data sources? What are the barriers that you and your users have to deal with when obtaining copies of their data for use with Prifina? What are the potential threats that you anticipate for users gaining and maintaining control of their own data? What are the untapped opportunities? What are the topics where you have had to invest the most in user education? What are the most interesting, innovative, or unexpected ways that you have seen Prifina used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Prifina? When is Prifina the wrong choice? What do you have planned for the future of Prifina? Contact Info LinkedIn @mmlampinen on Twitter mmlampinen on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Prifina Google Takeout The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
undefined
Sep 26, 2021 • 58min

Digging Into Data Reliability Engineering

Summary The accuracy and availability of data has become critically important to the day-to-day operation of businesses. Similar to the practice of site reliability engineering as a means of ensuring consistent uptime of web services, there has been a new trend of building data reliability engineering practices in companies that rely heavily on their data. In this episode Egor Gryaznov explains how this practice manifests from a technical and organizational perspective and how you can start adopting it in your own teams. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Schema changes, missing data, and volume anomalies caused by your data sources can happen without any advanced notice if you lack visibility into your data-in-motion. That leaves DataOps reactive to data quality issues and can make your consumers lose confidence in your data. By connecting to your pipeline orchestrator like Apache Airflow and centralizing your end-to-end metadata, Databand.ai lets you identify data quality issues and their root causes from a single dashboard. With Databand.ai, you’ll know whether the data moving from your sources to your warehouse will be available, accurate, and usable when it arrives. Go to dataengineeringpodcast.com/databand to sign up for a free 30-day trial of Databand.ai and take control of your data quality today. Your host is Tobias Macey and today I’m interviewing Egor Gryaznov, co-founder and CTO of Bigeye, about the ideas and practices of data reliability engineering and how to integrate it into your systems Interview Introduction How did you get involved in the area of data management? What does the term "Data Reliability Engineering" mean? What is encompassed under the umbrella of Data Reliability Engineering? How does it compare to the concepts from site reliability engineering? Is DRE just a repackaged version of DataOps? Why is Data Reliability Engineering particularly important now? Who is responsible for the practice of DRE in an organization? What are some areas of innovation that teams are focusing on to support a DRE practice? What are the tools that teams are using to improve the reliability of their data operations? What are the organizational systems that need to be in place to support a DRE practice? What are some potential roadblocks that teams might have to address when planning and implementing a DRE strategy? What are the most interesting, innovative, or unexpected approaches/solutions to DRE that you have seen? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Data Reliability Engineering? Is Data Reliability Engineering ever the wrong choice? What do you have planned for the future of Bigeye, especially in terms of Data Reliability Engineering? Contact Info Find us at bigeye.com or reach out to us at hello@bigeye.com You can find Egor on LinkedIn or email him at egor@bigeye.com Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Bigeye Podcast Episode Vertica Looker Podcast Episode Site Reliability Engineering Stemma Podcast Episode Collibra Podcast Episode OpenLineage Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
undefined
Sep 25, 2021 • 1h 4min

Massively Parallel Data Processing In Python Without The Effort Using Bodo

Summary Python has beome the de facto language for working with data. That has brought with it a number of challenges having to do with the speed and scalability of working with large volumes of information.There have been many projects and strategies for overcoming these challenges, each with their own set of tradeoffs. In this episode Ehsan Totoni explains how he built the Bodo project to bring the speed and processing power of HPC techniques to the Python data ecosystem without requiring any re-work. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/impact today to save your spot at IMPACT: The Data Observability Summit a half-day virtual event featuring the first U.S. Chief Data Scientist, founder of the Data Mesh, Creator of Apache Airflow, and more data pioneers spearheading some of the biggest movements in data. The first 50 to RSVP with this link will be entered to win an Oculus Quest 2 — Advanced All-In-One Virtual Reality Headset. RSVP today – you don’t want to miss it! Your host is Tobias Macey and today I’m interviewing Ehsan Totoni about Bodo, a system for automatically optimizing and parallelizing python code for massively parallel data processing and analytics Interview Introduction How did you get involved in the area of data management? Can you describe what Bodo is and the story behind it? What are the techniques/technologies that teams might use to optimize or scale out their data processing workflows? Why have you focused your efforts on the Python language and toolchain? Do you see any potential for expanding into other language communities? What are the shortcomings of projects such as Dask and Ray for scaling out Python data projects? Many people are familiar with the principle of HPC architectures, but can you share an overview of the current state of the art for HPC? What are the tradeoffs of HPC vs scale-out distributed systems? Can you describe the technical implementation of the Bodo platform? What are the aspects of the Python language and package ecosystem that have complicated the work of building an optimizing compiler? How do you handle compiled extensions? (e.g. C/C++/Fortran) What are some of the assumptions/expectations that you had when first approaching this project that have been challenged as you progressed through its implementation? How do you handle data distribution for scale out computation? What are some software architecture/programming patterns that act as bottlenecks/optimization cliffs for parallelization? What are some of the educational challenges that you have run into while working with potential and current customers? What are the most interesting, innovative, or unexpected ways that you have seen Bodo used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Bodo? When is Bodo the wrong choice? What do you have planned for the future of Bodo? Contact Info LinkedIn @EhsanTn on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Bodo High Performance Computing (HPC) University of Illinois, Urbana-Champaign Julia Language Pandas Podcast.__init__ Episode NumPy Dask Podcast Episode Ray Podcast.__init__ Episode Numba LLVM SPMD MPI Elastic Fabric Adapter Iceberg Table Format Podcast Episode IPython Parallel The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
undefined
Sep 19, 2021 • 1h 12min

Declarative Machine Learning Without The Operational Overhead Using Continual

Summary Building, scaling, and maintaining the operational components of a machine learning workflow are all hard problems. Add the work of creating the model itself, and it’s not surprising that a majority of companies that could greatly benefit from machine learning have yet to either put it into production or see the value. Tristan Zajonc recognized the complexity that acts as a barrier to adoption and created the Continual platform in response. In this episode he shares his perspective on the benefits of declarative machine learning workflows as a means of accelerating adoption in businesses that don’t have the time, money, or ambition to build everything from scratch. He also discusses the technical underpinnings of what he is building and how using the data warehouse as a shared resource drastically shortens the time required to see value. This is a fascinating episode and Tristan’s work at Continual is likely to be the catalyst for a new stage in the machine learning community. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Schema changes, missing data, and volume anomalies caused by your data sources can happen without any advanced notice if you lack visibility into your data-in-motion. That leaves DataOps reactive to data quality issues and can make your consumers lose confidence in your data. By connecting to your pipeline orchestrator like Apache Airflow and centralizing your end-to-end metadata, Databand.ai lets you identify data quality issues and their root causes from a single dashboard. With Databand.ai, you’ll know whether the data moving from your sources to your warehouse will be available, accurate, and usable when it arrives. Go to dataengineeringpodcast.com/databand to sign up for a free 30-day trial of Databand.ai and take control of your data quality today. Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Your host is Tobias Macey and today I’m interviewing Tristan Zajonc about Continual, a platform for automating the creation and application of operational AI on top of your data warehouse Interview Introduction How did you get involved in the area of data management? Can you describe what Continual is and the story behind it? What is your definition for "operational AI" and how does it differ from other applications of ML/AI? What are some example use cases for AI in an operational capacity? What are the barriers to adoption for organizations that want to take advantage of predictive analytics? Who are the target users of Continual? Can you describe how the Continual platform is implemented? How has the design and infrastructure changed or evolved since you first began working on it? What is the workflow for someone building a model and putting it into production? Once a model has been deployed, what are the mechanisms that you expose for interacting with it? How does this differ from in-database ML capabilities such as what is offered by Vertica and BigQuery? How much understanding of ML/AI principles is necessary for someone to create a model with Continual? What is your estimation of the impact that Continual can have on the overall productivity of a data team/data scientist? What are the most interesting, innovative, or unexpected ways that you have seen Continual used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Continual? When is Continual the wrong choice? What do you have planned for the future of Continual? Contact Info LinkedIn @tristanzajonc on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Continual World Bank SAS SPSS Stata Feature Store DataRobot Transfer Learning dbt Podcast Episode Ludwig Overton (Apple) Hightouch Census Galaxy Schema In-Database ML Podcast Episode scikit-learn Snorkel Podcast Episode Materialize Podcast Episode Flink SQL The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
undefined
Sep 19, 2021 • 55min

An Exploration Of The Data Engineering Requirements For Bioinformatics

Summary Biology has been gaining a lot of attention in recent years, even before the pandemic. As an outgrowth of that popularity, a new field has grown up that pairs statistics and compuational analysis with scientific research, namely bioinformatics. This brings with it a unique set of challenges for data collection, data management, and analytical capabilities. In this episode Jillian Rowe shares her experience of working in the field and supporting teams of scientists and analysts with the data infrastructure that they need to get their work done. This is a fascinating exploration of the collaboration between data professionals and scientists. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/impact today to save your spot at IMPACT: The Data Observability Summit a half-day virtual event featuring the first U.S. Chief Data Scientist, founder of the Data Mesh, Creator of Apache Airflow, and more data pioneers spearheading some of the biggest movements in data. The first 50 to RSVP with this link will be entered to win an Oculus Quest 2 — Advanced All-In-One Virtual Reality Headset. RSVP today – you don’t want to miss it! Your host is Tobias Macey and today I’m interviewing Jillian Rowe about data engineering practices for bioinformatics projects Interview Introduction How did you get involved in the area of data management? How did you get into the field of bioinformatics? Can you describe what is unique about data needs in bioinformatics? What are some of the problems that you have found yourself regularly solving for your clients? When building data engineering stacks for bioinformatics, what are the attributes that you are optimizing for? (e.g. speed, UX, scale, correctness, etc.) Can you describe a typical set of technologies that you implement when working on a new project? What kinds of systems do you need to integrate with? What are the data formats that are widely used for bioinformatics? What are some details that a data engineer would need to know to work effectively with those formats while preparing data for analysis? What amount of domain expertise is necessary for a data engineer to work in life sciences? What are the most interesting, innovative, or unexpected solutions that you have seen for manipulating bioinformatics data? What are the most interesting, unexpected, or challenging lessons that you have learned while working on bioinformatics projects? What are some of the industry/academic trends or upcoming technologies that you are tracking for bioinformatics? Contact Info LinkedIn jerowe on GitHub Website Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Bioinformatics How Perl Saved The Human Genome Project Neo4J AWS Parallel Cluster Datashader R Shiny Plotly Dash Apache Parquet Dask Podcast Episode HDF5 Spark Superset Data Engineering Podcast Episode Podcast.__init__ Episode FastQ file format BAM (Binary Alignment Map) File Variant Call Format (VCF) HIPAA DVC Podcast Episode LakeFS BioThings API The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
undefined
4 snips
Sep 12, 2021 • 59min

Setting The Stage For The Next Chapter Of The Cassandra Database

Summary The Cassandra database is one of the first open source options for globally scalable storage systems. Since its introduction in 2008 it has been powering systems at every scale. The community recently released a new major version that marks a milestone in its maturity and stability as a project and database. In this episode Ben Bromhead, CTO of Instaclustr, shares the challenges that the community has worked through, the work that went into the release, and how the stability and testing improvements are setting the stage for the future of the project. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Schema changes, missing data, and volume anomalies caused by your data sources can happen without any advanced notice if you lack visibility into your data-in-motion. That leaves DataOps reactive to data quality issues and can make your consumers lose confidence in your data. By connecting to your pipeline orchestrator like Apache Airflow and centralizing your end-to-end metadata, Databand.ai lets you identify data quality issues and their root causes from a single dashboard. With Databand.ai, you’ll know whether the data moving from your sources to your warehouse will be available, accurate, and usable when it arrives. Go to dataengineeringpodcast.com/databand to sign up for a free 30-day trial of Databand.ai and take control of your data quality today. Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Your host is Tobias Macey and today I’m interviewing Ben Bromhead about the recent release of Cassandra version 4 and how it fits in the current landscape of data tools Interview Introduction How did you get involved in the area of data management? For anyone who isn’t familiar with Cassandra, can you briefly describe what it is and some of the story behind it? How did you get involved in the Cassandra project and how would you characterize your role? What are the main use cases and industries where someone is likely to use Cassandra? What is notable about the version 4 release? What were some of the factors that contributed to the long delay between versions 3 and 4? (2015 – 2021) What are your thoughts on the ongoing utility/benefits of projects such as ScyllaDB, particularly in light of the most recent release? Cassandra is primarily used as a system of record. What are some of the tools and system architectures that users turn to when building analytical workloads for data stored in Cassandra? The architecture of Cassandra has lent itself well to the cloud native ecosystem that has been growing in recent years. What do you see as the opportunities for Cassandra over the near to medium term as the cloud continues to grow in prominence? What are some of the challenges that you and the Cassandra community have faced with the flurry of new data storage and processing systems that have popped up over the past few years? What are the most interesting, innovative, or unexpected ways that you have seen Cassandra used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Cassandra? When is Cassandra the wrong choice? What is in store for the future of Cassandra? Contact Info LinkedIn @benbromhead on Twitter benbromhead on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Cassandra Instaclustr HBase DynamoDB Whitepaper Property Based Testing QuickTheories Riak FoundationDB Podcast Episode ScyllaDB Podcast Episode YugabyteDB Podcast Episode Azure CosmoDB Amazon Keyspaces Netty Kafka CQRS == Command Query Responsibility Segregation Elasticsearch Redis Memcached Debezium Podcast Episode CDC == Change Data Capture Podcast Episodes Bigtable White Paper CockroachDB Podcast Episode Vitess CAP Theorem Paxos The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
undefined
Sep 9, 2021 • 1h 4min

A View From The Round Table Of Gartner's Cool Vendors

Summary Gartner analysts are tasked with identifying promising companies each year that are making an impact in their respective categories. For businesses that are working in the data management and analytics space they recognized the efforts of Timbr.ai, Soda Data, Nexla, and Tada. In this episode the founders and leaders of each of these organizations share their perspective on the current state of the market, and the challenges facing businesses and data professionals today. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Have you ever had to develop ad-hoc solutions for security, privacy, and compliance requirements? Are you spending too much of your engineering resources on creating database views, configuring database permissions, and manually granting and revoking access to sensitive data? Satori has built the first DataSecOps Platform that streamlines data access and security. Satori’s DataSecOps automates data access controls, permissions, and masking for all major data platforms such as Snowflake, Redshift and SQL Server and even delegates data access management to business users, helping you move your organization from default data access to need-to-know access. Go to dataengineeringpodcast.com/satori today and get a $5K credit for your next Satori subscription. Your host is Tobias Macey and today I’m interviewing Saket Saurabh, Maarten Masschelein, Akshay Deshpande, and Dan Weitzner about the challenges facing data practitioners today and the solutions that are being brought to market for addressing them, as well as the work they are doing that got them recognized as "cool vendors" by Gartner. Interview Introduction How did you get involved in the area of data management? Can you each describe what you view as the biggest challenge facing data professionals? Who are you building your solutions for and what are the most common data management problems are you all solving? What are different components of Data Management and why is it so complex? What will simplify this process, if any? The report covers a lot of new data management terminology – data governance, data observability, data fabric, data mesh, DataOps, MLOps, AIOps – what does this all mean and why is it important for data engineers? How has the data management space changed in recent times? Describe the current data management landscape and any key developments. From your perspective, what are the biggest challenges in the data management space today? What modern data management features are lacking in existing databases? Gartner imagines a future where data and analytics leaders need to be prepared to rely on data management solutions that make heterogeneous, distributed data appear consolidated, easy to access and business friendly. How does this tally with your vision of the future of data management and what needs to happen to make this a reality? What are the most interesting, innovative, or unexpected ways that you have seen your respective products used (in isolation or combined)? What are the most interesting, unexpected, or challenging lessons that you have learned while working on your respective platforms? What are the upcoming trends and challenges that you are keeping a close eye on? Contact Info Saket LinkedIn @saketsaurabh on Twitter Maarten LinkedIn @masscheleinm on Twitter Dan LinkedIn Akshay Website Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Nexla Soda Tada Timbr Collibra Podcast Episode Gartner Cool Vendors The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app