Data Engineering Podcast

Tobias Macey

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.

Episodes

Mentioned books

Oct 22, 2019 • 43min

Data Orchestration For Hybrid Cloud Analytics

Summary The scale and complexity of the systems that we build to satisfy business requirements is increasing as the available tools become more sophisticated. In order to bridge the gap between legacy infrastructure and evolving use cases it is necessary to create a unifying set of components. In this episode Dipti Borkar explains how the emerging category of data orchestration tools fills this need, some of the existing projects that fit in this space, and some of the ways that they can work together to simplify projects such as cloud migration and hybrid cloud environments. It is always useful to get a broad view of new trends in the industry and this was a helpful perspective on the need to provide mechanisms to decouple physical storage from computing capacity. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! This week’s episode is also sponsored by Datacoral, an AWS-native, serverless, data infrastructure that installs in your VPC. Datacoral helps data engineers build and manage the flow of data pipelines without having to manage any infrastructure, meaning you can spend your time invested in data transformations and business needs, rather than pipeline maintenance. Raghu Murthy, founder and CEO of Datacoral built data infrastructures at Yahoo! and Facebook, scaling from terabytes to petabytes of analytic data. He started Datacoral with the goal to make SQL the universal data programming language. Visit dataengineeringpodcast.com/datacoral today to find out more. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, Alluxio, and Data Council. Upcoming events include the combined events of the Data Architecture Summit and Graphorum, the Data Orchestration Summit, and Data Council in NYC. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Dipti Borkark about data orchestration and how it helps in migrating data workloads to the cloud Interview Introduction How did you get involved in the area of data management? Can you start by describing what you mean by the term "Data Orchestration"? How does it compare to the concept of "Data Virtualization"? What are some of the tools and platforms that fit under that umbrella? What are some of the motivations for organizations to use the cloud for their data oriented workloads? What are they giving up by using cloud resources in place of on-premises compute? For businesses that have invested heavily in their own datacenters, what are some ways that they can begin to replicate some of the benefits of cloud environments? What are some of the common patterns for cloud migration projects and what challenges do they present? Do you have advice on useful metrics to track for determining project completion or success criteria? How do businesses approach employee education for designing and implementing effective systems for achieving their migration goals? Can you talk through some of the ways that different data orchestration tools can be composed together for a cloud migration effort? What are some of the common pain points that organizations encounter when working on hybrid implementations? What are some of the missing pieces in the data orchestration landscape? Are there any efforts that you are aware of that are aiming to fill those gaps? Where is the data orchestration market heading, and what are some industry trends that are driving it? What projects are you most interested in or excited by? For someone who wants to learn more about data orchestration and the benefits the technologies can provide, what are some resources that you would recommend? Contact Info LinkedIn @dborkar on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Alluxio Podcast Episode UC San Diego Couchbase Presto Podcast Episode Spark SQL Data Orchestration Data Virtualization PyTorch Podcast.init Episode Rook storage orchestration PySpark MinIO Podcast Episode Kubernetes Openstack Hadoop HDFS Parquet Files Podcast Episode ORC Files Hive Metastore Iceberg Table Format Podcast Episode Data Orchestration Summit Star Schema Snowflake Schema Data Warehouse Data Lake Teradata The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Oct 15, 2019 • 47min

Keeping Your Data Warehouse In Order With DataForm

Summary Managing a data warehouse can be challenging, especially when trying to maintain a common set of patterns. Dataform is a platform that helps you apply engineering principles to your data transformations and table definitions, including unit testing SQL scripts, defining repeatable pipelines, and adding metadata to your warehouse to improve your team’s communication. In this episode CTO and co-founder of Dataform Lewis Hemens joins the show to explain his motivation for creating the platform and company, how it works under the covers, and how you can start using it today to get your data warehouse under control. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! This week’s episode is also sponsored by Datacoral. They provide an AWS-native, serverless, data infrastructure that installs in your VPC. Datacoral helps data engineers build and manage the flow of data pipelines without having to manage any infrastructure. Datacoral’s customers report that their data engineers are able to spend 80% of their work time invested in data transformations, rather than pipeline maintenance. Raghu Murthy, founder and CEO of Datacoral built data infrastructures at Yahoo! and Facebook, scaling from mere terabytes to petabytes of analytic data. He started Datacoral with the goal to make SQL the universal data programming language. Visit Datacoral.com today to find out more. Are you working on data, analytics, or AI using platforms such as Presto, Spark, or Tensorflow? Check out the Data Orchestration Summit on November 7 at the Computer History Museum in Mountain View. This one day conference is focused on the key data engineering challenges and solutions around building analytics and AI platforms. Attendees will hear from companies including Walmart, Netflix, Google, and DBS Bank on how they leveraged technologies such as Alluxio, Presto, Spark, Tensorflow, and you will also hear from creators of open source projects including Alluxio, Presto, Airflow, Iceberg, and more! Use discount code PODCAST for 25% off of your ticket, and the first five people to register get free tickets! Register now as early bird tickets are ending this week! Attendees will takeaway learnings, swag, a free voucher to visit the museum, and a chance to win the latest ipad Pro! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, Alluxio, and Data Council. Upcoming events include the combined events of the Data Architecture Summit and Graphorum, the Data Orchestration Summit, and Data Council in NYC. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Lewis Hemens about DataForm, a platform that helps analysts manage all data processes in your cloud data warehouse Interview Introduction How did you get involved in the area of data management? Can you start by explaining what DataForm is and the origin story for the platform and company? What are the main benefits of using a tool like DataForm and who are the primary users? Can you talk through the workflow for someone using DataForm and highlight the main features that it provides? What are some of the challenges and mistakes that are common among engineers and analysts with regard to versioning and evolving schemas and the accompanying data? How does CI/CD and change management manifest in the context of data warehouse management? How is the Dataform SDK itself implemented and how has it evolved since you first began working on it? Can you differentiate the capabilities between the open source CLI and the hosted web platform, and when you might need to use one over the other? What was your selection process for an embedded runtime and how did you decide on javascript? Can you talk through some of the use cases that having an embedded runtime enables? What are the limitations of SQL when working in a collaborative environment? Which database engines do you support and how do you reduce the maintenance burden for supporting different dialects and capabilities? What is involved in adding support for a new backend? When is DataForm the wrong choice? What do you have planned for the future of DataForm? Contact Info LinkedIn @lewishemens on Twitter lewish on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links DataForm YCombinator DBT == Data Build Tool Podcast Episode Fishtown Analytics Typescript Continuous Integration Continuous Delivery BigQuery Snowflake DB UDF == User Defined Function RedShift PostgreSQL Podcast Episode AWS Athena Presto Podcast Episode Apache Beam Apache Kafka Segment Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Oct 8, 2019 • 55min

Fast Analytics On Semi-Structured And Structured Data In The Cloud

Summary The process of exposing your data through a SQL interface has many possible pathways, each with their own complications and tradeoffs. One of the recent options is Rockset, a serverless platform for fast SQL analytics on semi-structured and structured data. In this episode CEO Venkat Venkataramani and SVP of Product Shruti Bhat explain the origins of Rockset, how it is architected to allow for fast and flexible SQL analytics on your data, and how their serverless platform can save you the time and effort of implementing portions of your own infrastructure. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! This week’s episode is also sponsored by Datacoral. They provide an AWS-native, serverless, data infrastructure that installs in your VPC. Datacoral helps data engineers build and manage the flow of data pipelines without having to manage any infrastructure. Datacoral’s customers report that their data engineers are able to spend 80% of their work time invested in data transformations, rather than pipeline maintenance. Raghu Murthy, founder and CEO of Datacoral built data infrastructures at Yahoo! and Facebook, scaling from mere terabytes to petabytes of analytic data. He started Datacoral with the goal to make SQL the universal data programming language. Visit Datacoral.com today to find out more. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, Alluxio, and Data Council. Upcoming events include the combined events of the Data Architecture Summit and Graphorum, the Data Orchestration Summit, and Data Council in NYC. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Shruti Bhat and Venkat Venkataramani about Rockset, a serverless platform for enabling fast SQL queries across all of your data Interview Introduction How did you get involved in the area of data management? Can you start by describing what Rockset is and your motivation for creating it? What are some of the use cases that it enables which would otherwise be impractical or intractable? How does Rockset fit into the infrastructure and workflow of data teams and what portions of a typical stack does it replace? Can you describe how the Rockset platform is architected and how it has evolved as you onboard more customers? Can you describe the flow of a piece of data as it traverses the full lifecycle in Rockset? How is your storage backend implemented to allow for speed and flexibility in the query layer? How does it manage distribution, balancing, and durability of the data? What are your strategies for handling node and region failure in the cloud? You have a whitepaper describing your architecture as being oriented around microservices on Kubernetes in order to be cloud agnostic. How do you handle the case where customers have data sources that span multiple cloud providers or regions and the latency that can result? How is the query engine structured to allow for optimizing so many different query types (e.g. search, graph, timeseries, etc.)? With Rockset handling a large portion of the underlying infrastructure work that a data engineer might be involved with, what are some ways that you have seen them use the time that they have gained and how has that benefitted the organizations that they work for? What are some of the most interesting/unexpected/innovative ways that you have seen Rockset used? When is Rockset the wrong choice for a given project? What have you found to be the most challenging and the most exciting aspects of building the Rockset platform and company? What do you have planned for the future of Rockset? Contact Info Venkat LinkedIn @iamveeve on Twitter veeve on GitHub Shruti LinkedIn @shrutibhat on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@podcastinit.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at pythonpodcast.com/chat Links Rockset Blog Oracle VMWare Facebook Rube Goldberg Machine SnowflakeDB Protocol Buffers Spark Podcast Episode Presto Podcast Episode Apache Kafka RocksDB InnoDB Lucene Log Structured Merge Tree (LSTM) Kubernetes The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Oct 1, 2019 • 35min

Ship Faster With An Opinionated Data Pipeline Framework

Summary Building an end-to-end data pipeline for your machine learning projects is a complex task, made more difficult by the variety of ways that you can structure it. Kedro is a framework that provides an opinionated workflow that lets you focus on the parts that matter, so that you don’t waste time on gluing the steps together. In this episode Tom Goldenberg explains how it works, how it is being used at Quantum Black for customer projects, and how it can help you structure your own. Definitely worth a listen to gain more understanding of the benefits that a standardized process can provide. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, and Data Council. Upcoming events include the combined events of the Data Architecture Summit and Graphorum, Data Council in Barcelona, and the Data Orchestration Summit. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Tom Goldenberg about Kedro, an open source development workflow tool that helps structure reproducible, scaleable, deployable, robust and versioned data pipelines. Interview Introduction How did you get involved in the area of data management? Can you start by explaining what Kedro is and its origin story? Who are the primary users of Kedro, and how does it fit into and impact the workflow of data engineers and data scientists? Can you talk through a typical lifecycle for a project that is built using Kedro? What are the overall features of Kedro and how do they compound to encourage best practices for data projects? How does the culture and background of QuantumBlack influence the design and capabilities of Kedro? What was the motivation for releasing it publicly as an open source framework? What are some examples of ways that Kedro is being used within QuantumBlack and how has that experience informed the design and direction of the project? Can you describe how Kedro itself is implemented and how it has evolved since you first started working on it? There has been a recent trend away from end-to-end ETL frameworks and toward a decoupled model that focuses on a programming target with pluggable execution. What are the industry pressures that are driving that shift and what are your thoughts on how that will manifest in the long term? How do the capabilities and focus of Kedro compare to similar projects such as Prefect and Dagster? It has not yet reached a stable release. What are the aspects of Kedro that are still in flux and where are the changes most concentrated? What is still missing for a stable 1.x release? What are some of the most interesting/innovative/unexpected ways that you have seen Kedro used? When is Kedro the wrong choice? What do you have in store for the future of Kedro? Contact Info LinkedIn @tomgoldenberg on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Kedro GitHub Quantum Black Labs GitHub Agolo McKinsey Airflow Docker Kubernetes DataBricks Formula 1 Kedro Viz Dask Podcast Interview Py.Test Azure Data Factory Prefect Podcast Interview Dagster The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Sep 23, 2019 • 1h 8min

Open Source Object Storage For All Of Your Data

Summary Object storage is quickly becoming the unifying layer for data intensive applications and analytics. Modern, cloud oriented data warehouses and data lakes both rely on the durability and ease of use that it provides. S3 from Amazon has quickly become the de-facto API for interacting with this service, so the team at MinIO have built a production grade, easy to manage storage engine that replicates that interface. In this episode Anand Babu Periasamy shares the origin story for the MinIO platform, the myriad use cases that it supports, and the challenges that they have faced in replicating the functionality of S3. He also explains the technical implementation, innovative design, and broad vision for the project. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, and Data Council. Upcoming events include the O’Reilly AI conference, the Strata Data conference, the combined events of the Data Architecture Summit and Graphorum, and Data Council in Barcelona. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Anand Babu Periasamy about MinIO, the neutral, open source, enterprise grade object storage system. Interview Introduction How did you get involved in the area of data management? Can you explain what MinIO is and its origin story? What are some of the main use cases that MinIO enables? How does MinIO compare to other object storage options and what benefits does it provide over other open source platforms? Your marketing focuses on the utility of MinIO for ML and AI workloads. What benefits does object storage provide as compared to distributed file systems? (e.g. HDFS, GlusterFS, Ceph) What are some of the challenges that you face in terms of maintaining compatibility with the S3 interface? What are the constraints and opportunities that are provided by adhering to that API? Can you describe how MinIO is implemented and the overall system design? How has that design evolved since you first began working on it? What assumptions did you have at the outset and how have they been challenged or updated? What are the axes for scaling that MinIO provides and how does it handle clustering? Where does it fall on the axes of availability and consistency in the CAP theorem? One of the useful features that you provide is efficient erasure coding, as well as protection against data corruption. How much overhead do those capabilties incur, in terms of computational efficiency and, in a clustered scenario, storage volume? For someone who is interested in running MinIO, what is involved in deploying and maintaining an installation of it? What are the cases where it makes sense to use MinIO in place of a cloud-native object store such as S3 or Google Cloud Storage? How do you approach project governance and sustainability? What are some of the most interesting/innovative/unexpected ways that you have seen MinIO used? What do you have planned for the future of MinIO? Contact Info LinkedIn @abperiasamy on Twitter abperiasamy on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links MinIO GlusterFS Object Storage RedHat Bionics AWS S3 Ceph Swift Stack POSIX HDFS Google BigQuery AzureML AWS SageMaker AWS Athena S3 Select Azure Blob Store BackBlaze Round Robin DNS Service Mesh Istio Envoy SmartStack Free Software RocksDB TanTan Blog Post Presto SparkML MCAdmin Trace DTrace The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Sep 18, 2019 • 58min

Navigating Boundless Data Streams With The Swim Kernel

Summary The conventional approach to analytics involves collecting large amounts of data that can be cleaned, followed by a separate step for analysis and interpretation. Unfortunately this strategy is not viable for handling real-time, real-world use cases such as traffic management or supply chain logistics. In this episode Simon Crosby, CTO of Swim Inc., explains how the SwimOS kernel and the enterprise data fabric built on top of it enable brand new use cases for instant insights. This was an eye opening conversation about how stateful computation of data streams from edge devices can reduce cost and complexity as compared to batch oriented workflows. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Listen, I’m sure you work for a ‘data driven’ company – who doesn’t these days? Does your company use Amazon Redshift? Have you ever groaned over slow queries or are just afraid that Amazon Redshift is gonna fall over at some point? Well, you’ve got to talk to the folks over at intermix.io. They have built the “missing” Amazon Redshift console – it’s an amazing analytics product for data engineers to find and re-write slow queries and gives actionable recommendations to optimize data pipelines. WeWork, Postmates, and Medium are just a few of their customers. Go to dataengineeringpodcast.com/intermix today and use promo code DEP at sign up to get a $50 discount! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, and Data Council. Upcoming events include the O’Reilly AI conference, the Strata Data conference, the combined events of the Data Architecture Summit and Graphorum, and Data Council in Barcelona. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Simon Crosby about Swim.ai, a data fabric for the distributed enterprise Interview Introduction How did you get involved in the area of data management? Can you start by explaining what Swim.ai is and how the project and business got started? Can you explain the differentiating factors between the SwimOS and Data Fabric platforms that you offer? What are some of the use cases that are enabled by the Swim platform that would otherwise be impractical or intractable? How does Swim help alleviate the challenges of working with sensor oriented applications or edge computing platforms? Can you describe a typical design for an application or system being built on top of the Swim platform? What does the developer workflow look like? What kind of tooling do you have for diagnosing and debugging errors in an application built on top of Swim? Can you describe the internal design for the SwimOS and how it has evolved since you first began working on it? For such widely distributed applications, efficient discovery and communication is essential. How does Swim handle that functionality? What mechanisms are in place to account for network failures? Since the application nodes are explicitly stateful, how do you handle scaling as compared to a stateless web application? Since there is no explicit data layer, how is data redundancy handled by Swim applications? What are some of the most interesting/unexpected/innovative ways that you have seen the Swim technology used? What have you found to be the most challenging aspects of building the Swim platform? What are some of the assumptions that you had going into the creation of SwimOS and how have they been challenged or updated? What do you have planned for the future of the technical and business aspects of Swim.ai? Contact Info LinkedIn Wikipedia @simoncrosby on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Swim.ai Hadoop Streaming Data Apache Flink Podcast Episode Apache Kafka Wallaroo Podcast Episode Digital Twin Swim Concepts Documentation RFID == Radio Frequency IDentification PCB == Printed Circuit Board Graal VM Azure IoT Edge Framework Azure DLS (Data Lake Storage) Power BI WARP Protocol LightBend The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Sep 10, 2019 • 55min

Building A Reliable And Performant Router For Observability Data

Summary The first stage in every data project is collecting information and routing it to a storage system for later analysis. For operational data this typically means collecting log messages and system metrics. Often a different tool is used for each class of data, increasing the overall complexity and number of moving parts. The engineers at Timber.io decided to build a new tool in the form of Vector that allows for processing both of these data types in a single framework that is reliable and performant. In this episode Ben Johnson and Luke Steensen explain how the project got started, how it compares to other tools in this space, and how you can get involved in making it even better. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, and Data Council. Upcoming events include the O’Reilly AI conference, the Strata Data conference, the combined events of the Data Architecture Summit and Graphorum, and Data Council in Barcelona. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Ben Johnson and Luke Steensen about Vector, a high-performance, open-source observability data router Interview Introduction How did you get involved in the area of data management? Can you start by explaining what the Vector project is and your reason for creating it? What are some of the comparable tools that are available and what were they lacking that prompted you to start a new project? What strategy are you using for project governance and sustainability? What are the main use cases that Vector enables? Can you explain how Vector is implemented and how the system design has evolved since you began working on it? How did your experience building the business and products for Timber influence and inform your work on Vector? When you were planning the implementation, what were your criteria for the runtime implementation and why did you decide to use Rust? What led you to choose Lua as the embedded scripting environment? What data format does Vector use internally? Is there any support for defining and enforcing schemas? In the event of a malformed message is there any capacity for a dead letter queue? What are some strategies for formatting source data to improve the effectiveness of the information that is gathered and the ability of Vector to parse it into useful data? When designing an event flow in Vector what are the available mechanisms for testing the overall delivery and any transformations? What options are available to operators to support visibility into the running system? In terms of deployment topologies, what capabilities does Vector have to support high availability and/or data redundancy? What are some of the other considerations that operators and administrators of Vector should be considering? You have a fairly well defined roadmap for the different point versions of Vector. How did you determine what the priority ordering was and how quickly are you progressing on your roadmap? What is the available interface for adding and extending the capabilities of Vector? (source/transform/sink) What are some of the most interesting/innovative/unexpected ways that you have seen Vector used? What are some of the challenges that you have faced in building/publicizing Vector? For someone who is interested in using Vector, how would you characterize the overall maturity of the project currently? What is missing that you would consider necessary for production readiness? When is Vector the wrong choice? Contact Info Ben @binarylogic on Twitter binarylogic on GitHub Luke LinkedIn @lukesteensen on Twitter lukesteensen on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Vector GitHub Timber.io Observability SeatGeek Apache Kafka StatsD FluentD Splunk Filebeat Logstash Fluent Bit Rust Tokio Rust library TOML Lua Nginx HAProxy Web Assembly (WASM) Protocol Buffers Jepsen The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Sep 2, 2019 • 53min

Building A Community For Data Professionals at Data Council

Summary Data professionals are working in a domain that is rapidly evolving. In order to stay current we need access to deeply technical presentations that aren’t burdened by extraneous marketing. To fulfill that need Pete Soderling and his team have been running the Data Council series of conferences and meetups around the world. In this episode Pete discusses his motivation for starting these events, how they serve to bring the data community together, and the observations that he has made about the direction that we are moving. He also shares his experiences as an investor in developer oriented startups and his views on the importance of empowering engineers to launch their own companies. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Listen, I’m sure you work for a ‘data driven’ company – who doesn’t these days? Does your company use Amazon Redshift? Have you ever groaned over slow queries or are just afraid that Amazon Redshift is gonna fall over at some point? Well, you’ve got to talk to the folks over at intermix.io. They have built the “missing” Amazon Redshift console – it’s an amazing analytics product for data engineers to find and re-write slow queries and gives actionable recommendations to optimize data pipelines. WeWork, Postmates, and Medium are just a few of their customers. Go to dataengineeringpodcast.com/intermix today and use promo code DEP at sign up to get a $50 discount! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, and Data Council. Upcoming events include the O’Reilly AI conference, the Strata Data conference, the combined events of the Data Architecture Summit and Graphorum, and Data Council in Barcelona. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Pete Soderling about his work to build and grow a community for data professionals with the Data Council conferences and meetups, as well as his experiences as an investor in data oriented companies Interview Introduction How did you get involved in the area of data management? What was your original reason for focusing your efforts on fostering a community of data engineers? What was the state of recognition in the industry for that role at the time that you began your efforts? The current manifestation of your community efforts is in the form of the Data Council conferences and meetups. Previously they were known as Data Eng Conf and before that was Hakka Labs. Can you discuss the evolution of your efforts to grow this community? How has the community itself changed and grown over the past few years? Communities form around a huge variety of focal points. What are some of the complexities or challenges in building one based on something as nebulous as data? Where do you draw inspiration and direction for how to manage such a large and distributed community? What are some of the most interesting/challenging/unexpected aspects of community management that you have encountered? What are some ways that you have been surprised or delighted in your interactions with the data community? How do you approach sustainability of the Data Council community and the organization itself? The tagline that you have focused on for Data Council events is that they are no fluff, juxtaposing them against larger business oriented events. What are your guidelines for fulfilling that promise and why do you think that is an important distinction? In addition to your community building you are also an investor. How did you get involved in that side of your business and how does it fit into your overall mission? You also have a stated mission to help engineers build their own companies. In your opinion, how does an engineer led business differ from one that may be founded or run by a business oriented individual and why do you think that we need more of them? What are the ways that you typically work to empower engineering founders or encourage them to create their own businesses? What are some of the challenges that engineering founders face and what are some common difficulties or misunderstandings related to business? What are your opinions on venture-backed vs. "lifestyle" or bootstrapped businesses? What are the characteristics of a data business that you look at when evaluating a potential investment? What are some of the current industry trends that you are most excited by? What are some that you find concerning? What are your goals and plans for the future of Data Council? Contact Info @petesoder on Twitter LinkedIn @petesoder on Medium Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Data Council Database Design For Mere Mortals Bloomberg Garmin 500 Startups Geeks On A Plane Data Council NYC 2019 Track Summary Pete’s Angel List Syndicate DataOps Data Kitchen Episode DataOps Vs DevOps Episode Great Expectations Podcast.__init__ Interview Elementl Dagster Data Council Presentation Data Council Call For Proposals The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Aug 26, 2019 • 48min

Building Tools And Platforms For Data Analytics

Summary Data engineers are responsible for building tools and platforms to power the workflows of other members of the business. Each group of users has their own set of requirements for the way that they access and interact with those platforms depending on the insights they are trying to gather. Benn Stancil is the chief analyst at Mode Analytics and in this episode he explains the set of considerations and requirements that data analysts need in their tools and. He also explains useful patterns for collaboration between data engineers and data analysts, and what they can learn from each other. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, and Data Counsil. Upcoming events include the O’Reilly AI conference, the Strata Data conference, the combined events of the Data Architecture Summit and Graphorum, and Data Council in Barcelona. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Benn Stancil, chief analyst at Mode Analytics, about what data engineers need to know when building tools for analysts Interview Introduction How did you get involved in the area of data management? Can you start by describing some of the main features that you are looking for in the tools that you use? What are some of the common shortcomings that you have found in out-of-the-box tools that organizations use to build their data stack? What should data engineers be considering as they design and implement the foundational data platforms that higher order systems are built on, which are ultimately used by analysts and data scientists? In terms of mindset, what are the ways that data engineers and analysts can align and where are the points of conflict? In terms of team and organizational structure, what have you found to be useful patterns for reducing friction in the product lifecycle for data tools (internal or external)? What are some anti-patterns that data engineers can guard against as they are designing their pipelines? In your experience as an analyst, what have been the characteristics of the most seamless projects that you have been involved with? How much understanding of analytics are necessary for data engineers to be successful in their projects and careers? Conversely, how much understanding of data management should analysts have? What are the industry trends that you are most excited by as an analyst? Contact Info LinkedIn @bennstancil on Twitter Website Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Mode Analytics Data Council Presentation Yammer StitchFix Blog Post SnowflakeDB Re:Dash Superset Marquez Amundsen Podcast Episode Elementl Dagster Data Council Presentation DBT Podcast Episode Great Expectations Podcast.__init__ Episode Delta Lake Podcast Episode Stitch Fivetran Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Aug 19, 2019 • 1h 14min

A High Performance Platform For The Full Big Data Lifecycle

Summary Managing big data projects at scale is a perennial problem, with a wide variety of solutions that have evolved over the past 20 years. One of the early entrants that predates Hadoop and has since been open sourced is the HPCC (High Performance Computing Cluster) system. Designed as a fully integrated platform to meet the needs of enterprise grade analytics it provides a solution for the full lifecycle of data at massive scale. In this episode Flavio Villanustre, VP of infrastructure and products at HPCC Systems, shares the history of the platform, how it is architected for scale and speed, and the unique solutions that it provides for enterprise grade data analytics. He also discusses the motivations for open sourcing the platform, the detailed workflow that it enables, and how you can try it for your own projects. This was an interesting view of how a well engineered product can survive massive evolutionary shifts in the industry while remaining relevant and useful. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! To connect with the startups that are shaping the future and take advantage of the opportunities that they provide, check out Angel List where you can invest in innovative business, find a job, or post a position of your own. Sign up today at dataengineeringpodcast.com/angel and help support this show. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, and Data Counsil. Upcoming events include the O’Reilly AI conference, the Strata Data conference, the combined events of the Data Architecture Summit and Graphorum, and Data Council in Barcelona. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Flavio Villanustre about the HPCC Systems project and his work at LexisNexis Risk Solutions Interview Introduction How did you get involved in the area of data management? Can you start by describing what the HPCC system is and the problems that you were facing at LexisNexis Risk Solutions which led to its creation? What was the overall state of the data landscape at the time and what was the motivation for releasing it as open source? Can you describe the high level architecture of the HPCC Systems platform and some of the ways that the design has changed over the years that it has been maintained? Given how long the project has been in use, can you talk about some of the ways that it has had to evolve to accomodate changing trends in usage and technologies for big data and advanced analytics? For someone who is using HPCC Systems, can you talk through a common workflow and the ways that the data traverses the various components? How does HPCC Systems manage persistence and scalability? What are the integration points available for extending and enhancing the HPCC Systems platform? What is involved in deploying and managing a production installation of HPCC Systems? The ECL language is an intriguing element of the overall system. What are some of the features that it provides which simplify processing and management of data? How does the Thor engine manage data transformation and manipulation? What are some of the unique features of Thor and how does it compare to other approaches for ETL and data integration? For extraction and analysis of data can you talk through the capabilities of the Roxie engine? How are you using the HPCC Systems platform in your work at LexisNexis? Despite being older than the Hadoop platform it doesn’t seem that HPCC Systems has seen the same level of growth and popularity. Can you share your perspective on the community for HPCC Systems and how it compares to that of Hadoop over the past decade? How is the HPCC Systems project governed, and what is your approach to sustainability? What are some of the additional capabilities that are only available in the enterprise distribution? When is the HPCC Systems platform the wrong choice, and what are some systems that you might use instead? What have been some of the most interesting/unexpected/novel ways that you have seen HPCC Systems used? What are some of the challenges that you have faced and lessons that you have learned while building and maintaining the HPCC Systems platform and community? What do you have planned for the future of HPCC Systems? Contact Info LinkedIn @fvillanustre on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links HPCC Systems LexisNexis Risk Solutions Risk Management Hadoop MapReduce Sybase Oracle DB AbInitio Data Lake SQL ECL DataFlow TensorFlow ECL IDE The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

App store banner

Play store banner