Data Engineering Podcast

Tobias Macey

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.

Episodes

Mentioned books

Nov 3, 2020 • 50min

Cutting Through The Noise And Focusing On The Fundamentals Of Data Engineering With The Data Janitor

Summary Data engineering is a constantly growing and evolving discipline. There are always new tools, systems, and design patterns to learn, which leads to a great deal of confusion for newcomers. Daniel Molnar has dedicated his time to helping data professionals get back to basics through presentations at conferences and meetups, and with his most recent endeavor of building the Pipeline Data Engineering Academy. In this episode he shares advice on how to cut through the noise, which principles are foundational to building a successful career as a data engineer, and his approach to educating the next generation of data practitioners. This was a useful conversation for anyone working with data who has found themselves spending too much time chasing the latest trends and wishes to develop a more focused approach to their work. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Are you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud. Our comprehensive data-level security, auditing and de-identification features eliminate the need for time-consuming manual processes and our focus on data and compliance team collaboration empowers you to deliver quick and valuable data analytics on the most sensitive data to unlock the full potential of your cloud data platforms. Learn how we streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today! Your host is Tobias Macey and today I’m interviewing Daniel Molnar about being a data janitor and how to cut through the hype to understand what to learn for the long run Interview Introduction How did you get involved in the area of data management? Can you start by describing your thoughts on the current state of the data management industry? What is your strategy for being effective in the face of so much complexity and conflicting needs for data? What are some of the common difficulties that you see data engineers contend with, whether technical or social/organizational? What are the core fundamentals that you think are necessary for data engineers to be effective? What are the gaps in knowledge or experience that you have seen data engineers contend with? You recently started down the path of building a bootcamp for training data engineers. What was your motivation for embarking on that journey? How would you characterize your particular approach? What are some of the reasons that your applicants have for wanting to become versed in data engineering? What is the baseline of capabilities that you expect of your target audience? What level of proficiency do you aim for when someone has completed your training program? Who do you think would not be a good fit for your academy? As a hiring manager, what are the core capabilities that you look for in a data engineering candidate? What are some of the methods that you use to assess competence? What are the overall trends in the data management space that you are worried by? Which ones are you happy about? What are your plans and overall goals for the pipeline academy? Contact Info LinkedIn @soobrosa on Twitter Website Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Pipeline Data Engineering Academy Data Janitor 101 The Data Janitor Returns Berlin, Germany Hungary Urchin google analytics precursor AWS Redshift Nassim Nicholas Taleb Black Swans (affiliate link) KISS == Keep It Simple Stupid Dan McKinley Ralph Kimball Data Warehousing design Falsehoods Programmers Believe Apache Kafka AWS Kinesis ETL/ELT CI/CD Telemetry Dêpeche Mode Designing Data Intensive Applications (affiliate link) Stop Hiring DevOps Engineers and Start Growing Them T Shaped Engineer Pipeline Data Engineering Academy Curriculum MPP == Massively Parallel Processing Apache Flink Podcast Episode Flask web framework YAGNI == You Ain’t Gonna Need It Pair Programming Clojure The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Sep 15, 2020 • 44min

Distributed In Memory Processing And Streaming With Hazelcast

Summary In memory computing provides significant performance benefits, but brings along challenges for managing failures and scaling up. Hazelcast is a platform for managing stateful in-memory storage and computation across a distributed cluster of commodity hardware. On top of this foundation, the Hazelcast team has also built a streaming platform for reliable high throughput data transmission. In this episode Dale Kim shares how Hazelcast is implemented, the use cases that it enables, and how it complements on-disk data management systems. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Tree Schema is a data catalog that is making metadata management accessible to everyone. With Tree Schema you can create your data catalog and have it fully populated in under five minutes when using one of the many automated adapters that can connect directly to your data stores. Tree Schema includes essential cataloging features such as first class support for both tabular and unstructured data, data lineage, rich text documentation, asset tagging and more. Built from the ground up with a focus on the intersection of people and data, your entire team will find it easier to foster collaboration around your data. With the most transparent pricing in the industry – $99/mo for your entire company – and a money-back guarantee for excellent service, you’ll love Tree Schema as much as you love your data. Go to dataengineeringpodcast.com/treeschema today to get your first month free, and mention this podcast to get %50 off your first three months after the trial. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today! Your host is Tobias Macey and today I’m interviewing Dale Kim about Hazelcast, a distributed in-memory computing platform for data intensive applications Interview Introduction How did you get involved in the area of data management? Can you start by describing what Hazelcast is and its origins? What are the benefits and tradeoffs of in-memory computation for data-intensive workloads? What are some of the common use cases for the Hazelcast in memory grid? How is Hazelcast implemented? How has the architecture evolved since it was first created? How is the Jet streaming framework architected? What was the motivation for building it? How do the capabilities of Jet compare to systems such as Flink or Spark Streaming? How has the introduction of hardware capabilities such as NVMe drives influenced the market for in-memory systems? How is the governance of the open source grid and Jet projects handled? What is the guiding heuristic for which capabilities or features to include in the open source projects vs. the commercial offerings? What is involved in building an application or workflow on top of Hazelcast? What are the common patterns for engineers who are building on top of Hazelcast? What is involved in deploying and maintaining an installation of the Hazelcast grid or Jet streaming? What are the scaling factors for Hazelcast? What are the edge cases that users should be aware of? What are some of the most interesting, innovative, or unexpected ways that you have seen Hazelcast used? When is Hazelcast Grid or Jet the wrong choice? What is in store for the future of Hazelcast? Contact Info LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links HazelCast Istanbul Apache Spark OrientDB CAP Theorem NVMe Memristors Intel Optane Persistent Memory Hazelcast Jet Kappa Architecture IBM Cloud Paks Digital Integration Hub (Gartner) The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Sep 7, 2020 • 54min

Simplify Your Data Architecture With The Presto Distributed SQL Engine

Summary Databases are limited in scope to the information that they directly contain. For analytical use cases you often want to combine data across multiple sources and storage locations. This frequently requires cumbersome and time-consuming data integration. To address this problem Martin Traverso and his colleagues at Facebook built the Presto distributed query engine. In this episode he explains how it is designed to allow for querying and combining data where it resides, the use cases that such an architecture unlocks, and the innovative ways that it is being employed at companies across the world. If you need to work with data in your cloud data lake, your on-premise database, or a collection of flat files, then give this episode a listen and then try out Presto today. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today! Your host is Tobias Macey and today I’m interviewing Martin Traverso about PrestoSQL, a distributed SQL engine that queries data in place Interview Introduction How did you get involved in the area of data management? Can you start by giving an overview of what Presto is and its origin story? What was the motivation for releasing Presto as open source? For someone who is responsible for architecting their organization’s data platform, what are some of the signals that Presto will be a good fit for them? What are the primary ways that Presto is being used? I interviewed your colleague at Starburst, Kamil 2 years ago. How has Presto changed or evolved in that time, both technically and in terms of community and ecosystem growth? What are some of the deployment and scaling considerations that operators of Presto should be aware of? What are the best practices that have been established for working with data through Presto in terms of centralizing in a data lake vs. federating across disparate storage locations? What are the tradeoffs of using Presto on top of a data lake vs a vertically integrated warehouse solution? When designing the layout of a data lake that will be interacted with via Presto, what are some of the data modeling considerations that can improve the odds of success? What are some of the most interesting, unexpected, or innovative ways that you have seen Presto used? What are the most interesting, unexpected, or challenging lessons that you have learned while building, growing, and supporting the Presto project? When is Presto the wrong choice? What is in store for the future of the Presto project and community? Contact Info LinkedIn @mtraverso on Twitter martint on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Presto Starburst Data Podcast Episode Hadoop Hive Glue Metastore BigQuery Kinesis Apache Pinot Elasticsearch ORC Parquet AWS Redshift Avro Podcast Episode LZ4 Zstandard KafkaSQL Flink Podcast Episode PyTorch Podcast.__init__ Episode Tensorflow Spark The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Sep 1, 2020 • 1h 6min

Building A Better Data Warehouse For The Cloud At Firebolt

Summary Data warehouse technology has been around for decades and has gone through several generational shifts in that time. The current trends in data warehousing are oriented around cloud native architectures that take advantage of dynamic scaling and the separation of compute and storage. Firebolt is taking that a step further with a core focus on speed and interactivity. In this episode CEO and founder Eldad Farkash explains how the Firebolt platform is architected for high throughput, their simple and transparent pricing model to encourage widespread use, and the use cases that it unlocks through interactive query speeds. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Today’s episode of the Data Engineering Podcast is sponsored by Datadog, a SaaS-based monitoring and analytics platform for cloud-scale infrastructure, applications, logs, and more. Datadog uses machine-learning based algorithms to detect errors and anomalies across your entire stack—which reduces the time it takes to detect and address outages and helps promote collaboration between Data Engineering, Operations, and the rest of the company. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial. If you start a trial and install Datadog’s agent, Datadog will send you a free T-shirt. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today! Your host is Tobias Macey and today I’m interviewing Eldad Farkash about Firebolt, a cloud data warehouse optimized for speed and elasticity on structured and semi-structured data Interview Introduction How did you get involved in the area of data management? Can you start by describing what Firebolt is and your motivation for building it? How does Firebolt compare to other data warehouse technologies what unique features does it provide? The lines between a data warehouse and a data lake have been blurring in recent years. Where on that continuum does Firebolt lie? What are the unique use cases that Firebolt allows for? How do the performance characteristics of Firebolt change the ways that an engineer should think about data modeling? What technologies might someone replace with Firebolt? How is Firebolt architected and how has the design evolved since you first began working on it? What are some of the most challenging aspects of building a data warehouse platform that is optimized for speed? How do you handle support for nested and semi-structured data? In what ways have you found it necessary/useful to extend SQL? Due to the immutability of object storage, for data lakes the update or delete process involves reprocessing a potentially large amount of data. How do you approach that in Firebolt with your F3 format? What have you found to be the most interesting, unexpected, or challenging lessons while building and scaling the Firebolt platform and business? When is Firebolt the wrong choice? What do you have planned for the future of Firebolt? Contact Info LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Firebolt Sisense SnowflakeDB Podcast Episode Redshift Spark Podcast Episode Parquet Podcast Episode Hadoop HDFS S3 AWS Athena BigQuery Data Vault Podcast Episode Star Schema Dimensional Modeling Slowly Changing Dimensions JDBC TPC Benchmarks DBT Podcast Episode Tableau Looker Podcast Episode PrestoSQL Podcast Episode PostgreSQL Podcast Episode FoundationDB Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app

Data Engineering Podcast

Episodes

Mentioned books

Add Version Control To Your Data Lake With LakeFS

Cloud Native Data Security As Code With Cyral

Better Data Quality Through Observability With Monte Carlo

Rapid Delivery Of Business Intelligence Using Power BI

Self Service Real Time Data Integration Without The Headaches With Meroxa

Speed Up And Simplify Your Streaming Data Workloads With Red Panda

Cutting Through The Noise And Focusing On The Fundamentals Of Data Engineering With The Data Janitor

Distributed In Memory Processing And Streaming With Hazelcast

Simplify Your Data Architecture With The Presto Distributed SQL Engine

Building A Better Data Warehouse For The Cloud At Firebolt

The AI-powered Podcast Player