Data Engineering Podcast cover image

Data Engineering Podcast

Latest episodes

undefined
Oct 29, 2021 • 1h 9min

Removing The Barrier To Exploratory Analytics with Activity Schema and Narrator

Summary The perennial question of data warehousing is how to model the information that you are storing. This has given rise to methods as varied as star and snowflake schemas, data vault modeling, and wide tables. The challenge with many of those approaches is that they are optimized for answering known questions but brittle and cumbersome when exploring unknowns. In this episode Ahmed Elsamadisi shares his journey to find a more flexible and universal data model in the form of the "activity schema" that is powering the Narrator platform, and how it has allowed his customers to perform self-service exploration of their business domains without being blocked by schema evolution in the data warehouse. This is a fascinating exploration of what can be done when you challenge your assumptions about what is possible. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Your host is Tobias Macey and today I’m interviewing Ahmed Elsamadisi about Narrator, a platform to enable anyone to go from question to data-driven decision in minutes Interview Introduction How did you get involved in the area of data management? Can you describe what Narrator is and the story behind it? What are the challenges that you have seen organizations encounter when attempting to make analytics a self-serve capability? What are the use cases that you are focused on? How does Narrator fit within the data workflows of an organization? How is the Narrator platform implemented? How has the design and focus of the technology evolved since you first started working on Narrator? The core element of the analyses that you are building is the "activity schema". Can you describe the design process that led you to that format? What are the challenges that are posed by more widely used modeling techniques such as star/snowflake or data vault? How does the activity schema address those challenges? What are the performance characteristics of deriving models from an activity schema/timeseries table? For someone who wants to use Narrator, what is involved in transforming their data to map into the activity schema? Can you talk through the domain modeling that needs to happen when determining what entities and actions to capture? What are the most interesting, innovative, or unexpected ways that you have seen Narrator used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Narrator? When is Narrator the wrong choice? What do you have planned for the future of Narrator? Contact Info LinkedIn @ae4ai on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Narrator DARPA Challenge Fivetran Luigi Chartio Airflow Domain Driven Design Data Vault Snowflake Schema Event Sourcing Census Podcast Episode Hightouch Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
undefined
Oct 29, 2021 • 1h 10min

Streaming Data Pipelines Made SQL With Decodable

Summary Streaming data systems have been growing more capable and flexible over the past few years. Despite this, it is still challenging to build reliable pipelines for stream processing. In this episode Eric Sammer discusses the shortcomings of the current set of streaming engines and how they force engineers to work at an extremely low level of abstraction. He also explains why he started Decodable to address that limitation and the work that he and his team have done to let data engineers build streaming pipelines entirely in SQL. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/impact today to save your spot at IMPACT: The Data Observability Summit a half-day virtual event featuring the first U.S. Chief Data Scientist, founder of the Data Mesh, Creator of Apache Airflow, and more data pioneers spearheading some of the biggest movements in data. The first 50 to RSVP with this link will be entered to win an Oculus Quest 2 — Advanced All-In-One Virtual Reality Headset. RSVP today – you don’t want to miss it! Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch. Your host is Tobias Macey and today I’m interviewing Eric Sammer about Decodable, a platform for simplifying the work of building real-time data pipelines Interview Introduction How did you get involved in the area of data management? Can you describe what Decodable is and the story behind it? Who are the target users, and how has that focus informed your prioritization of features at launch? What are the complexities that data engineers encounter when building pipelines on streaming systems? What are the distributed systems concepts and design optimizations that are often skipped over or misunderstood by engineers who are using them? (e.g. backpressure, exactly once semantics, isolation levels, etc.) How do those mismatches in understanding and expectation impact the correctness and reliability of the workflows that they are building? Can you describe how you have architected the Decodable platform? What have been the most complex or time consuming engineering challenges that you have dealt with so far? What are the points of integration that you expose for engineers to wire in their existing infrastructure and data systems? What has been your process for designing the interfaces and abstractions that you are exposing to end users? What are some of the leaks in those abstractions that have either started to show or are anticipated? What have you learned about the state of data engineering and the costs and benefits of real-time data while working on Decodable? What are the most interesting, innovative, or unexpected ways that you have seen Decodable used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Decodable? When is Decodable the wrong choice? What do you have planned for the future of Decodable? Contact Info esammer on GitHub @esammer on Twitter LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Decodable Cloudera Kafka Flink Podcast Episode Spark Snowflake Podcast Episode BigQuery RedShift kSQLDB Podcast Episode dbt Podcast Episode Millwheel Paper Dremel Paper Timely Dataflow Materialize Podcast Episode Software Defined Networking Data Mesh Podcast Episode OpenLineage Podcast Episode DataHub Podcast Episode Amundsen Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
undefined
Oct 23, 2021 • 1h 6min

Data Exploration For Business Users Powered By Analytics Engineering With Lightdash

Summary The market for business intelligence has been going through an evolutionary shift in recent years. One of the driving forces for that change has been the rise of analytics engineering powered by dbt. Lightdash has fully embraced that shift by building an entire open source business intelligence framework that is powered by dbt models. In this episode Oliver Laslett describes why dashboards aren’t sufficient for business analytics, how Lightdash promotes the work that you are already doing in your data warehouse modeling with dbt, and how they are focusing on bridging the divide between data teams and business teams and the requirements that they have for data workflows. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch. Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Your host is Tobias Macey and today I’m interviewing Oliver Laslett about Lightdash, an open source business intelligence system powered by your dbt models Interview Introduction How did you get involved in the area of data management? Can you describe what Lightdash is and the story behind it? What are the main goals of the project? Who are the target users, and how has that profile informed your feature priorities? Business intelligence is a market that has gone through several generational shifts, with products targeting numerous personas and purposes. What are the capabilities that make Lightdash stand out from the other options? Can you describe how Lightdash is architected? How have the design and goals of the system changed or evolved since you first began working on it? What have been the most challenging engineering problems that you have dealt with? How does the approach that you are taking with Lightdash compare to systems such as Transform and Metriql that aim to provide a dedicated metrics layer? Can you describe the workflow for someone building an analysis in Lightdash? What are the points of collaboration around Lightdash for different roles in the organization? What are the methods that you use to expose information about the state of the underlying dbt models to the end users? How do they use that information in their exploration and decision making? What was your motivation for releasing Lightdash as open source? How are you handling the governance and long-term viability of the project? What are the most interesting, innovative, or unexpected ways that you have seen Lightdash used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Lightdash? When is Lightdash the wrong choice? What do you have planned for the future of Lightdash? Contact Info LinkedIn owlas on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Lightdash Looker Podcast Episode PowerBI Podcast Episode Redash Podcast Episode Metabase Podcast Episode dbt Podcast Episode Superset Podcast Episode Streamlit Podcast Episode Kubernetes JDBC SQLAlchemy SQLPad Singer Podcast Episode Airbyte Podcast Episode Meltano Podcast Episode Transform Podcast Episode Metriql Podcast Episode Cube.js OpenLineage Podcast Episode dbt Packages Rudderstack PostHog Podcast Interview Firebolt Podcast Interview The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
undefined
Oct 21, 2021 • 1h 9min

Completing The Feedback Loop Of Data Through Operational Analytics With Census

Summary The focus of the past few years has been to consolidate all of the organization’s data into a cloud data warehouse. As a result there have been a number of trends in data that take advantage of the warehouse as a single focal point. Among those trends is the advent of operational analytics, which completes the cycle of data from collection, through analysis, to driving further action. In this episode Boris Jabes, CEO of Census, explains how the work of synchronizing cleaned and consolidated data about your customers back into the systems that you use to interact with those customers allows for a powerful feedback loop that has been missing in data systems until now. He also discusses how Census makes that synchronization easy to manage, how it fits with the growth of data quality tooling, and how you can start using it today. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/impact today to save your spot at IMPACT: The Data Observability Summit a half-day virtual event featuring the first U.S. Chief Data Scientist, founder of the Data Mesh, Creator of Apache Airflow, and more data pioneers spearheading some of the biggest movements in data. The first 50 to RSVP with this link will be entered to win an Oculus Quest 2 — Advanced All-In-One Virtual Reality Headset. RSVP today – you don’t want to miss it! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Your host is Tobias Macey and today I’m interviewing Boris Jabes about Census and the growing category of operational analytics Interview Introduction How did you get involved in the area of data management? Can you describe what Census is and the story behind it? The terms "reverse ETL" and "operational analytics" have started being used for similar, and often interchangeable, purposes. What are your thoughts on the semantic and concrete differences between these phrases? What are the motivating factors for adding operational analytics or "data activation" to an organization’s data platform? This is a nascent but quickly growing market with a number of products and projects operating in the space. How would you characterize the current state of the segment and Census’ position in it? Can you describe how the Census platform is implemented? What are some of the early design choices that have had to be refactored or augmented as you have evolved the product and worked with customers? What are some of the assumptions that you had about the needs and uses for the platform which have been challenged or changed as you dug deeper into the problem? Can you describe the workflow for a customer adopting Census? What are some of the data modeling practices that make it easier to "activate" the organization’s data? Another recent trend in the data industry is the growth of data quality and data lineage tools. What is involved in using the measured quality or lineage information as a signal in the operational systems, or to prevent a synchronization? How can users test and validate their workflows in Census? What are the options for propagating Census’ runtime information back into lineage and data quality tracking? Census supports incremental syncs from the warehouse. What are the opportunities for bringing streaming architectures to the space of operational analytics? What are the challenges/complexities in the current set of technologies that act as a barrier? What are the most interesting, innovative, or unexpected ways that you have seen Census used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Census? When is Census the wrong choice? What do you have planned for the future of Census? Contact Info LinkedIn Website @borisjabes on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Census Operational Analytics Fivetran Podcast Episode dbt Podcast Episode Snowflake Podcast Episode Loom Materialize Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
undefined
Oct 16, 2021 • 1h 8min

Bringing The Power Of The DataHub Real-Time Metadata Graph To Everyone At Acryl Data

Summary The binding element of all data work is the metadata graph that is generated by all of the workflows that produce the assets used by teams across the organization. The DataHub project was created as a way to bring order to the scale of LinkedIn’s data needs. It was also designed to be able to work for small scale systems that are just starting to develop in complexity. In order to support the project and make it even easier to use for organizations of every size Shirshanka Das and Swaroop Jagadish founded Acryl Data. In this episode they discuss the recent work that has been done by the community, how their work is building on top of that foundation, and how you can get started with DataHub for your own work to manage data discovery today. They also share their ambitions for the near future of adding data observability and data quality management features. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch. Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask. Your host is Tobias Macey and today I’m interviewing Shirshanka Das and Swaroop Jagadish about Acryl Data, the company driving the open source metadata project DataHub for powering data discovery, data observability and federated data governance. Interview Introduction How did you get involved in the area of data management? Can you describe what Acryl Data is and the story behind it? How has your experience of building and running DataHub at LinkedIn informed your product direction for Acryl? What are some lessons that your co-founder Swaroop has contributed from his experience at AirBnB? The data catalog/discovery/quality market has become very active over the past year. What is your perspective on the market, and what are the gaps that are not yet being addressed? How does the focus of Acryl compare to what the team at Metaphor are building? How has the DataHub project changed in the past year with more companies outside of LinkedIn using and contributing to it? What are your plans for Data Observability? Can you describe the system architecture that you have built at Acryl? What are the convenience features that you are building to augment the capabilities and integration process for DataHub? What are some typical workflows that data teams build out when working with Acryl? What are some examples of automated actions that can be triggered from metadata changes? What are the available events that can be used to trigger actions? What are some of the challenges that teams are facing when integrating metadata management and analysis into their data workflows? What are your thoughts on the potential for the Open Lineage and Open metadata projects? How is the governance of DataHub being managed? What are the most interesting, innovative, or unexpected ways that you have seen Acryl/DataHub used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Acryl/DataHub? When is Acryl the wrong choice? What do you have planned for the future of Acryl? Contact Info Shirshanka LinkedIn @shirshanka on Twitter shirshanka on GitHub Swaroop LinkedIn @arudis on Twitter swaroopjagadish on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Acryl Data DataHub Hudi Podcast Episode Iceberg Podcast Episode Delta Lake Podcast Episode Apache Gobblin Airflow Superset Podcast Episode Collibra Podcast Episode Alation Strata Conference Presentation Acryl/DataHub Ingestion Framework Joe Hellerstein Trifacta DataHub Roadmap Data Mesh OpenLineage Podcast Episode OpenMetadata Egeria Open Metadata The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
undefined
Oct 14, 2021 • 1h 2min

How And Why To Become Data Driven As A Business

Summary Organizations of all sizes are striving to become data driven, starting in earnest with the rise of big data a decade ago. With the never-ending growth in data sources and methods for aggregating and analyzing them, the use of data to direct the business has become a requirement. Randy Bean has been helping enterprise organizations define and execute their data strategies since before the age of big data. In this episode he discusses his experiences and how he approached the work of distilling them for his book "Fail Fast, Learn Faster". This is an entertaining and enlightening exploration of the business side of data with an industry veteran. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/impact today to save your spot at IMPACT: The Data Observability Summit a half-day virtual event featuring the first U.S. Chief Data Scientist, founder of the Data Mesh, Creator of Apache Airflow, and more data pioneers spearheading some of the biggest movements in data. The first 50 to RSVP with this link will be entered to win an Oculus Quest 2 — Advanced All-In-One Virtual Reality Headset. RSVP today – you don’t want to miss it! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Your host is Tobias Macey and today I’m interviewing Randy Bean about his recent book focusing on the use of big data and AI for informing data driven business leadership Interview Introduction How did you get involved in the area of data management? Can you start by discussing the focus of the book and what motivated you to write it? Who is the intended audience, and how did that inform the tone and content? Businesses and their officers have been aiming to be "data driven" for years. In your experience, what are the concrete goals that are implied by that term? What are the barriers that organizations encounter in the pursuit of those goals? How have the success rates (real and imagined) shifted in recent years as the level of sophistication of the tools and industry for data management has increased? What is the state of data initiatives in leading corporations today? What are the biggest opportunities and risks that organizations focus on related to their use of data? At what level(s) of the organization do lessons around data ethics need to be embedded? You have been working with large companies for many years to help them with their adoption of "big data". How has your work on this book shifted or clarified your perspectives on the subject? What are the main lessons or ideas that you hope readers will take away from the book? What are the most interesting, innovative, or unexpected ways that you have seen big data applied to business? What are the most interesting, unexpected, or challenging lessons that you have learned while working on this book? What are your predictions for the next decade of big data and AI? Contact Info @RandyBeanNVP on Twitter LinkedIn Email Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Fail Fast, Learn Faster: Lessons in Data-Driven Leadership in an Age of Disruption, Big Data, and AI (affiliate link) Book Website Harvard Business Review MIT Sloan Review New Vantage Partners COBOL Moneyball Weapons of Math Destruction The Seven Roles of the Chief Data Officer The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
undefined
Oct 8, 2021 • 44min

Make Your Business Metrics Reusable With Open Source Headless BI Using Metriql

Summary The key to making data valuable to business users is the ability to calculate meaningful metrics and explore them along useful dimensions. Business intelligence tools have provided this capability for years, but they don’t offer a means of exposing those metrics to other systems. Metriql is an open source project that provides a headless BI system where you can define your metrics and share them with all of your other processes. In this episode Burak Kabakcı shares the story behind the project, how you can use it to create your metrics definitions, and the benefits of treating the semantic layer as a dedicated component of your platform. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask. Your host is Tobias Macey and today I’m interviewing Burak Emre Kabakcı about Metriql, a headless BI and metrics layer for your data stack Interview Introduction How did you get involved in the area of data management? Can you describe what Metriql is and the story behind it? What are the characteristics and benefits of a "headless BI" system? What was your motivation to create and open-source Metriql as an independent project outside of your business? How are you approaching governance and sustainability of the project? How does Metriql compare to projects such as AirBnB’s Minerva or Transform’s platform? How does the industry/vertical of a business impact their ability to benefit from a metrics layer/headless BI? What are the limitations to the logical complexity that can be applied to the calculation of a given metric/set of metrics? Can you describe how Metriql is implemented? How have the design and goals of the project changed or evolved since you began working on it? What are the most complex/difficult engineering elements of building a metrics layer? Can you describe the workflow of defining metrics? What have been your guiding principles in defining the user experience for working with metriql? What are the opportunities for including business users in the definition of metrics? (e.g. pushing down/generating definitions from a BI layer) What are the biggest challenges and limitations of creating metrics definitions purely in SQL? What are the options for exposing metrics back to the warehouse and other operational systems such as reverse ETL vendors? What are the missing elements in the data ecosystem for taking full advantage of a headless BI/metrics layer? What are the most interesting, innovative, or unexpected ways that you have seen Metriql used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Metriql? When is Metriql the wrong choice? What do you have planned for the future of Metriql? Contact Info LinkedIn Website buremba on GitHub @bu7emba on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Metriql Rakam Hazelcast Headless BI Google Data Studio Superset Podcast Episode Podcast.__init__ Episode Trino Podcast Episode Supergrain The Missing Piece Of The Modern Data Stack article by Benn Stancil Metabase Podcast Episode dbt Podcast Episode dbt-metabase re_data OpenMetadata The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
undefined
Oct 6, 2021 • 46min

Adding Support For Distributed Transactions To The Redpanda Streaming Engine

Summary Transactions are a necessary feature for ensuring that a set of actions are all performed as a single unit of work. In streaming systems this is necessary to ensure that a set of messages or transformations are all executed together across different queues. In this episode Denis Rystsov explains how he added support for transactions to the Redpanda streaming engine. He discusses the use cases for transactions, the different strategies, semantics, and guarantees that they might need to support, and how his implementation ended up improving the performance of bulk write operations. This is an interesting deep dive into the internals of a high performance streaming engine and the details that are involved in building distributed systems. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/impact today to save your spot at IMPACT: The Data Observability Summit a half-day virtual event featuring the first U.S. Chief Data Scientist, founder of the Data Mesh, Creator of Apache Airflow, and more data pioneers spearheading some of the biggest movements in data. The first 50 to RSVP with this link will be entered to win an Oculus Quest 2 — Advanced All-In-One Virtual Reality Headset. RSVP today – you don’t want to miss it! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Your host is Tobias Macey and today I’m interviewing Denis Rystsov about implementing transactions in the RedPanda streaming engine Interview Introduction How did you get involved in the area of data management? Can you quickly recap what RedPanda is and the goals of the project? What are the use cases for transactions in a pub/sub messaging system? What are the elements of streaming systems that make atomic transactions a complex problem? What was the motivation for starting down the path of adding transactions to the RedPanda engine? How did the constraint of supporting the Kafka API influence your implementation strategy for transaction semantics? Can you talk through the details of how you ended up implementing transactions in RedPanda? What are some of the roadblocks and complexities that you encountered while working through the implementation? How did you approach the validation and verification of the transactions? What other features or capabilities are you planning to work on next? What are the most interesting, innovative, or unexpected ways that you have seen transactions in RedPanda used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on transactions for RedPanda? When are transactions the wrong choice? What do you have planned for the future of transaction support in RedPanda? Contact Info @rystsov on twitter LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Vectorized RedPanda Podcast Episode RedPanda Transactions Post Yandex Cassandra MongoDB Riak Cosmos DB Jepsen Podcast Episode Testing Shared Memories paper Journal of Systems Research Kafka Pulsar Seastar Framework CockroachDB Podcast Episode TiDB Calvin Paper Polyjuice Paper Parallel Commit Chaos Testing Matchmaker Paxos Algorithm The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
undefined
Oct 2, 2021 • 1h 8min

Building Real-Time Data Platforms For Large Volumes Of Information With Aerospike

Summary Aerospike is a database engine that is designed to provide millisecond response times for queries across terabytes or petabytes. In this episode Chief Strategy Officer, Lenley Hensarling, explains how the ability to process these large volumes of information in real-time allows businesses to unlock entirely new capabilities. He also discusses the technical implementation that allows for such extreme performance and how the data model contributes to the scalability of the system. If you need to deal with massive data, at high velocities, in milliseconds, then Aerospike is definitely worth learning about. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold’s proactive approach to data quality helps data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold. Your host is Tobias Macey and today I’m interviewing Lenley Hensarling about Aerospike and building real-time data platforms Interview Introduction How did you get involved in the area of data management? Can you describe what Aerospike is and the story behind it? What are the use cases that it is uniquely well suited for? What are the use cases that you and the Aerospike team are focusing on and how does that influence your focus on priorities of feature development and user experience? What are the driving factors for building a real-time data platform? How is Aerospike being incorporated in application and data architectures? Can you describe how the Aerospike engine is architected? How have the design and architecture changed or evolved since it was first created? How have market forces influenced the product priorities and focus? What are the challenges that end users face when determining how to model their data given a key/value storage interface? What are the abstraction layers that you and/or your users build to manage reliational or hierarchical data architectures? What are the operational characteristics of the Aerospike system? (e.g. deployment, scaling, CP vs AP, upgrades, clustering, etc.) What are the most interesting, innovative, or unexpected ways that you have seen Aerospike used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Aerospike? When is Aerospike the wrong choice? What do you have planned for the future of Aerospike? Contact Info LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Aerospike GitHub EnterpriseDB "Nobody Expects The Spanish Inquisition" ARM CPU Architectures AWS Graviton Processors The Datacenter Is The Computer (Affiliate link) Jepsen Tests Podcast Episode Cloud Native Computing Foundation Prometheus Grafana OpenTelemetry Podcast.__init__ Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
undefined
Sep 30, 2021 • 1h 12min

Delivering Your Personal Data Cloud With Prifina

Summary The promise of online services is that they will make your life easier in exchange for collecting data about you. The reality is that they use more information than you realize for purposes that are not what you intended. There have been many attempts to harness all of the data that you generate for gaining useful insights about yourself, but they are generally difficult to set up and manage or require software development experience. The team at Prifina have built a platform that allows users to create their own personal data cloud and install applications built by developers that power useful experiences while keeping you in full control. In this episode Markus Lampinen shares the goals and vision of the company, the technical aspects of making it a reality, and the future vision for how services can be designed to respect user’s privacy while still providing compelling experiences. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/impact today to save your spot at IMPACT: The Data Observability Summit a half-day virtual event featuring the first U.S. Chief Data Scientist, founder of the Data Mesh, Creator of Apache Airflow, and more data pioneers spearheading some of the biggest movements in data. The first 50 to RSVP with this link will be entered to win an Oculus Quest 2 — Advanced All-In-One Virtual Reality Headset. RSVP today – you don’t want to miss it! Your host is Tobias Macey and today I’m interviewing Markus Lampinen about Prifina, a platform for building applications powered by personal data that is under the user’s control Interview Introduction How did you get involved in the area of data management? Can you describe what Prifina is and the story behind it? What are the primary goals of Prifina? There has been a lof of interest in the "quantified self" and different projects (many that are open source) which aim to aggregate all of a user’s data into a single system for analysis and integration. What was lacking in the ecosystem that makes Prifina necessary/valuable? What are some of the personalized applications for this data that have been most compelling or that users are most interested in? What are the sources of complexity that you are facing when managing access/privacy of user’s data? Can you describe the architecture of the platform that you are building? What are the technological/social/economic underpinnings that are necessary to make a platform like Prifina possible? What are the assumptions that you had when you first became involved in the project which have been challenged or invalidated as you worked through the implementation and began engaging with users and developers? How do you approach schema definition/management for developers to have a stable implementation target? How has that schema evolved as you introduced new data sources? What are the barriers that you and your users have to deal with when obtaining copies of their data for use with Prifina? What are the potential threats that you anticipate for users gaining and maintaining control of their own data? What are the untapped opportunities? What are the topics where you have had to invest the most in user education? What are the most interesting, innovative, or unexpected ways that you have seen Prifina used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Prifina? When is Prifina the wrong choice? What do you have planned for the future of Prifina? Contact Info LinkedIn @mmlampinen on Twitter mmlampinen on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Prifina Google Takeout The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Remember Everything You Learn from Podcasts

Save insights instantly, chat with episodes, and build lasting knowledge - all powered by AI.
App store bannerPlay store banner
Get the app