Data Engineering Podcast

Tobias Macey

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.

Episodes

Mentioned books

Nov 19, 2018 • 48min

Stateful, Distributed Stream Processing on Flink with Fabian Hueske - Episode 57

Summary Modern applications and data platforms aspire to process events and data in real time at scale and with low latency. Apache Flink is a true stream processing engine with an impressive set of capabilities for stateful computation at scale. In this episode Fabian Hueske, one of the original authors, explains how Flink is architected, how it is being used to power some of the world’s largest businesses, where it sits in the lanscape of stream processing tools, and how you can start using it today. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Fabian Hueske, co-author of the upcoming O’Reilly book Stream Processing With Apache Flink, about his work on Apache Flink, the stateful streaming engine Interview Introduction How did you get involved in the area of data management? Can you start by describing what Flink is and how the project got started? What are some of the primary ways that Flink is used? How does Flink compare to other streaming engines such as Spark, Kafka, Pulsar, and Storm? What are some use cases that Flink is uniquely qualified to handle? Where does Flink fit into the current data landscape? How is Flink architected? How has that architecture evolved? Are there any aspects of the current design that you would do differently if you started over today? How does scaling work in a Flink deployment? What are the scaling limits? What are some of the failure modes that users should be aware of? How is the statefulness of a cluster managed? What are the mechanisms for managing conflicts? What are the limiting factors for the volume of state that can be practically handled in a cluster and for a given purpose? Can state be shared across processes or tasks within a Flink cluster? What are the comparative challenges of working with bounded vs unbounded streams of data? How do you handle out of order events in Flink, especially as the delay for a given event increases? For someone who is using Flink in their environment, what are the primary means of interacting with and developing on top of it? What are some of the most challenging or complicated aspects of building and maintaining Flink? What are some of the most interesting or unexpected ways that you have seen Flink used? What are some of the improvements or new features that are planned for the future of Flink? What are some features or use cases that you are explicitly not planning to support? For people who participate in the training sessions that you offer through Data Artisans, what are some of the concepts that they are challenged by? What do they find most interesting or exciting? Contact Info LinkedIn @fhueske on Twitter fhueske on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Flink Data Artisans IBM DB2 Technische Universität Berlin Hadoop Relational Database Google Cloud Dataflow Spark Cascading Java RocksDB Flink Checkpoints Flink Savepoints Kafka Pulsar Storm Scala LINQ (Language INtegrated Query) SQL Backpressure Watermarks HDFS S3 Avro JSON Hive Metastore Dell EMC Pravega The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast

Nov 11, 2018 • 52min

How Upsolver Is Building A Data Lake Platform In The Cloud with Yoni Iny - Episode 56

Summary A data lake can be a highly valuable resource, as long as it is well built and well managed. Unfortunately, that can be a complex and time-consuming effort, requiring specialized knowledge and diverting resources from your primary business. In this episode Yoni Iny, CTO of Upsolver, discusses the various components that are necessary for a successful data lake project, how the Upsolver platform is architected, and how modern data lakes can benefit your organization. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Yoni Iny about Upsolver, a data lake platform that lets developers integrate and analyze streaming data with ease Interview Introduction How did you get involved in the area of data management? Can you start by describing what Upsolver is and how it got started? What are your goals for the platform? There are a lot of opinions on both sides of the data lake argument. When is it the right choice for a data platform? What are the shortcomings of a data lake architecture? How is Upsolver architected? How has that architecture changed over time? How do you manage schema validation for incoming data? What would you do differently if you were to start over today? What are the biggest challenges at each of the major stages of the data lake? What is the workflow for a user of Upsolver and how does it compare to a self-managed data lake? When is Upsolver the wrong choice for an organization considering implementation of a data platform? Is there a particular scale or level of data maturity for an organization at which they would be better served by moving management of their data lake in house? What features or improvements do you have planned for the future of Upsolver? Contact Info Yoni yoniiny on GitHub LinkedIn Upsolver Website @upsolver on Twitter LinkedIn Facebook Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Upsolver Data Lake Israeli Army Data Warehouse Data Engineering Podcast Episode About Data Curation Three Vs Kafka Spark Presto Drill Spot Instances Object Storage Cassandra Redis Latency Avro Parquet ORC Data Engineering Podcast Episode About Data Serialization Formats SSTables Run Length Encoding CSV (Comma Separated Values) Protocol Buffers Kinesis ETL DevOps Prometheus Cloudwatch DataDog InfluxDB SQL Pandas Confluent KSQL The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast

Nov 5, 2018 • 58min

Self Service Business Intelligence And Data Sharing Using Looker with Daniel Mintz - Episode 55

Summary Business intelligence is a necessity for any organization that wants to be able to make informed decisions based on the data that they collect. Unfortunately, it is common for different portions of the business to build their reports with different assumptions, leading to conflicting views and poor choices. Looker is a modern tool for building and sharing reports that makes it easy to get everyone on the same page. In this episode Daniel Mintz explains how the product is architected, the features that make it easy for any business user to access and explore their reports, and how you can use it for your organization today. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Daniel Mintz about Looker, a a modern data platform that can serve the data needs of an entire company Interview Introduction How did you get involved in the area of data management? Can you start by describing what Looker is and the problem that it is aiming to solve? How do you define business intelligence? How is Looker unique from other approaches to business intelligence in the enterprise? How does it compare to open source platforms for BI? Can you describe the technical infrastructure that supports Looker? Given that you are connecting to the customer’s data store, how do you ensure sufficient security? For someone who is using Looker, what does their workflow look like? How does that change for different user roles (e.g. data engineer vs sales management) What are the scaling factors for Looker, both in terms of volume of data for reporting from, and for user concurrency? What are the most challenging aspects of building a business intelligence tool and company in the modern data ecosystem? What are the portions of the Looker architecture that you would do differently if you were to start over today? What are some of the most interesting or unusual uses of Looker that you have seen? What is in store for the future of Looker? Contact Info LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Looker Upworthy MoveOn.org LookML SQL Business Intelligence Data Warehouse Linux Hadoop BigQuery Snowflake Redshift DB2 PostGres ETL (Extract, Transform, Load) ELT (Extract, Load, Transform) Airflow Luigi NiFi Data Curation Episode Presto Hive Athena DRY (Don’t Repeat Yourself) Looker Action Hub Salesforce Marketo Twilio Netscape Navigator Dynamic Pricing Survival Analysis DevOps BigQuery ML Snowflake Data Sharehouse The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast

Oct 29, 2018 • 41min

Using Notebooks As The Unifying Layer For Data Roles At Netflix with Matthew Seal - Episode 54

Summary Jupyter notebooks have gained popularity among data scientists as an easy way to do exploratory analysis and build interactive reports. However, this can cause difficulties when trying to move the work of the data scientist into a more standard production environment, due to the translation efforts that are necessary. At Netflix they had the crazy idea that perhaps that last step isn’t necessary, and the production workflows can just run the notebooks directly. Matthew Seal is one of the primary engineers who has been tasked with building the tools and practices that allow the various data oriented roles to unify their work around notebooks. In this episode he explains the rationale for the effort, the challenges that it has posed, the development that has been done to make it work, and the benefits that it provides to the Netflix data platform teams. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Matthew Seal about the ways that Netflix is using Jupyter notebooks to bridge the gap between data roles Interview Introduction How did you get involved in the area of data management? Can you start by outlining the motivation for choosing Jupyter notebooks as the core interface for your data teams? Where are you using notebooks and where are you not? What is the technical infrastructure that you have built to suppport that design choice? Which team was driving the effort? Was it difficult to get buy in across teams? How much shared code have you been able to consolidate or reuse across teams/roles? Have you investigated the use of any of the other notebook platforms for similar workflows? What are some of the notebook anti-patterns that you have encountered and what conventions or tooling have you established to discourage them? What are some of the limitations of the notebook environment for the work that you are doing? What have been some of the most challenging aspects of building production workflows on top of Jupyter notebooks? What are some of the projects that are ongoing or planned for the future that you are most excited by? Contact Info Matthew Seal Email LinkedIn @codeseal on Twitter MSeal on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Netflix Notebook Blog Posts Nteract Tooling OpenGov Project Jupyter Zeppelin Notebooks Papermill Titus Commuter Scala Python R Emacs NBDime The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast

Oct 22, 2018 • 46min

Of Checklists, Ethics, and Data with Emily Miller and Peter Bull (Cross Post from Podcast.init) - Episode 53

Summary As data science becomes more widespread and has a bigger impact on the lives of people, it is important that those projects and products are built with a conscious consideration of ethics. Keeping ethical principles in mind throughout the lifecycle of a data project helps to reduce the overall effort of preventing negative outcomes from the use of the final product. Emily Miller and Peter Bull of Driven Data have created Deon to improve the communication and conversation around ethics among and between data teams. It is a Python project that generates a checklist of common concerns for data oriented projects at the various stages of the lifecycle where they should be considered. In this episode they discuss their motivation for creating the project, the challenges and benefits of maintaining such a checklist, and how you can start using it today. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat This is your host Tobias Macey and this week I am sharing an episode from my other show, Podcast.__init__, about a project from Driven Data called Deon. It is a simple tool that generates a checklist of ethical considerations for the various stages of the lifecycle for data oriented projects. This is an important topic for all of the teams involved in the management and creation of projects that leverage data. So give it a listen and if you like what you hear, be sure to check out the other episodes at pythonpodcast.com Interview Introductions How did you get introduced to Python? Can you start by describing what Deon is and your motivation for creating it? Why a checklist, specifically? What’s the advantage of this over an oath, for example? What is unique to data science in terms of the ethical concerns, as compared to traditional software engineering? What is the typical workflow for a team that is using Deon in their projects? Deon ships with a default checklist but allows for customization. What are some common addendums that you have seen? Have you received pushback on any of the default items? How does Deon simplify communication around ethics across team boundaries? What are some of the most often overlooked items? What are some of the most difficult ethical concerns to comply with for a typical data science project? How has Deon helped you at Driven Data? What are the customer facing impacts of embedding a discussion of ethics in the product development process? Some of the items on the default checklist coincide with regulatory requirements. Are there any cases where regulation is in conflict with an ethical concern that you would like to see practiced? What are your hopes for the future of the Deon project? Keep In Touch Emily LinkedIn ejm714 on GitHub Peter LinkedIn @pjbull on Twitter pjbull on GitHub Driven Data @drivendataorg on Twitter drivendataorg on GitHub Website Picks Tobias Richard Bond Glass Art Emily Tandem Coffee in Portland, Maine Peter The Model Bakery in Saint Helena and Napa, California Links Deon Driven Data International Development Brookings Institution Stata Econometrics Metis Bootcamp Pandas Podcast Episode C# .NET Podcast.__init__ Episode On Software Ethics Jupyter Notebook Podcast Episode Word2Vec cookiecutter data science Logistic Regression The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast

Oct 15, 2018 • 54min

Improving The Performance Of Cloud-Native Big Data At Netflix Using The Iceberg Table Format with Ryan Blue - Episode 52

Summary With the growth of the Hadoop ecosystem came a proliferation of implementations for the Hive table format. Unfortunately, with no formal specification, each project works slightly different which increases the difficulty of integration across systems. The Hive format is also built with the assumptions of a local filesystem which results in painful edge cases when leveraging cloud object storage for a data lake. In this episode Ryan Blue explains how his work on the Iceberg table format specification and reference implementation has allowed Netflix to improve the performance and simplify operations for their S3 data lake. This is a highly detailed and technical exploration of how a well-engineered metadata layer can improve the speed, accuracy, and utility of large scale, multi-tenant, cloud-native data platforms. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Ryan Blue about Iceberg, a Netflix project to implement a high performance table format for batch workloads Interview Introduction How did you get involved in the area of data management? Can you start by explaining what Iceberg is and the motivation for creating it? Was the project built with open-source in mind or was it necessary to refactor it from an internal project for public use? How has the use of Iceberg simplified your work at Netflix? How is the reference implementation architected and how has it evolved since you first began work on it? What is involved in deploying it to a user’s environment? For someone who is interested in using Iceberg within their own environments, what is involved in integrating it with their existing query engine? Is there a migration path for pre-existing tables into the Iceberg format? How is schema evolution managed at the file level? How do you handle files on disk that don’t contain all of the fields specified in a table definition? One of the complicated problems in data modeling is managing table partitions. How does Iceberg help in that regard? What are the unique challenges posed by using S3 as the basis for a data lake? What are the benefits that outweigh the difficulties? What have been some of the most challenging or contentious details of the specification to define? What are some things that you have explicitly left out of the specification? What are your long-term goals for the Iceberg specification? Do you anticipate the reference implementation continuing to be used and maintained? Contact Info rdblue on GitHub LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Iceberg Reference Implementation Iceberg Table Specification Netflix Hadoop Cloudera Avro Parquet Spark S3 HDFS Hive ORC S3mper Git Metacat Presto Pig DDL (Data Definition Language) Cost-Based Optimization The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast

Oct 9, 2018 • 57min

Combining Transactional And Analytical Workloads On MemSQL with Nikita Shamgunov

SummaryOne of the most complex aspects of managing data for analytical workloads is moving it from a transactional database into the data warehouse. What if you didn’t have to do that at all? MemSQL is a distributed database built to support concurrent use by transactional, application oriented, and analytical, high volume, workloads on the same hardware. In this episode the CEO of MemSQL describes how the company and database got started, how it is architected for scale and speed, and how it is being used in production. This was a deep dive on how to build a successful company around a powerful platform, and how that platform simplifies operations for enterprise grade data management.PreambleHello and welcome to the Data Engineering Podcast, the show about modern data managementWhen you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.You work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science.And the team at Metis Machine has shipped a proof-of-concept integration between the Skafos machine learning platform and the Tableau business intelligence tool, meaning that your BI team can now run the machine learning models custom built by your data science team. If you think that sounds awesome (and it is) then join the free webinar with Metis Machine on October 11th at 2 PM ET (11 AM PT). Metis Machine will walk through the architecture of the extension, demonstrate its capabilities in real time, and illustrate the use case for empowering your BI team to modify and run machine learning models directly from Tableau. Go to metismachine.com/webinars now to register.Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chatYour host is Tobias Macey and today I’m interviewing Nikita Shamgunov about MemSQL, a newSQL database built for simultaneous transactional and analytic workloadsInterviewIntroductionHow did you get involved in the area of data management?Can you start by describing what MemSQL is and how the product and business first got started?What are the typical use cases for customers running MemSQL?What are the benefits of integrating the ingestion pipeline with the database engine? What are some typical ways that the ingest capability is leveraged by customers?How is MemSQL architected and how has the internal design evolved from when you first started working on it?Where does it fall on the axes of the CAP theorem?How much processing overhead is involved in the conversion from the column oriented data stored on disk to the row oriented data stored in memory?Can you describe the lifecycle of a write transaction?Can you discuss the techniques that are used in MemSQL to optimize for speed and overall system performance?How do you mitigate the impact of network latency throughout the cluster during query planning and execution?How much of the implementation of MemSQL is using custom built code vs. open source projects?What are some of the common difficulties that your customers encounter when building on top of or migrating to MemSQL?What have been some of the most challenging aspects of building and growing the technical and business implementation of MemSQL?When is MemSQL the wrong choice for a data platform?What do you have planned for the future of MemSQL?Contact Info@nikitashamgunov on TwitterLinkedInParting QuestionFrom your perspective, what is the biggest gap in the tooling or technology for data management today?LinksMemSQLNewSQLMicrosoft SQL ServerSt. Petersburg University of Fine Mechanics And OpticsCC++In-Memory DatabaseRAM (Random Access Memory)Flash StorageOracle DBPostgreSQLPodcast EpisodeKafkaKinesisWealth ManagementData WarehouseODBCS3HDFSAvroParquetData Serialization Podcast EpisodeBroadcast JoinShuffle JoinCAP TheoremApache ArrowLZ4S2 Geospatial LibrarySybaseSAP HanaKubernetesThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Oct 1, 2018 • 53min

Building A Knowledge Graph From Public Data At Enigma With Chris Groskopf - Episode 50

Summary There are countless sources of data that are publicly available for use. Unfortunately, combining those sources and making them useful in aggregate is a time consuming and challenging process. The team at Enigma builds a knowledge graph for use in your own data projects. In this episode Chris Groskopf explains the platform they have built to consume large varieties and volumes of public data for constructing a graph for serving to their customers. He discusses the challenges they are facing to scale the platform and engineering processes, as well as the workflow that they have established to enable testing of their ETL jobs. This is a great episode to listen to for ideas on how to organize a data engineering organization. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. You work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Chris Groskopf about Enigma and how the are using public data sources to build a knowledge graph Interview Introduction How did you get involved in the area of data management? Can you give a brief overview of what Enigma has built and what the motivation was for starting the company? How do you define the concept of a knowledge graph? What are the processes involved in constructing a knowledge graph? Can you describe the overall architecture of your data platform and the systems that you use for storing and serving your knowledge graph? What are the most challenging or unexpected aspects of building the knowledge graph that you have encountered? How do you manage the software lifecycle for your ETL code? What kinds of unit, integration, or acceptance tests do you run to ensure that you don’t introduce regressions in your processing logic? What are the current challenges that you are facing in building and scaling your data infrastructure? How does the fact that your data sources are primarily public influence your pipeline design and what challenges does it pose? What techniques are you using to manage accuracy and consistency in the data that you ingest? Can you walk through the lifecycle of the data that you process from acquisition through to delivery to your customers? What are the weak spots in your platform that you are planning to address in upcoming projects? If you were to start from scratch today, what would you have done differently? What are some of the most interesting or unexpected uses of your product that you have seen? What is in store for the future of Enigma? Contact Info Email Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Enigma Chicago Tribune NPR Quartz CSVKit Agate Knowledge Graph Taxonomy Concourse Airflow Docker S3 Data Lake Parquet Podcast Episode Spark AWS Neptune AWS Batch Money Laundering Jupyter Notebook Papermill Jupytext Cauldron: The Un-Notebook Podcast.__init__ Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast

Sep 24, 2018 • 50min

A Primer On Enterprise Data Curation with Todd Walter - Episode 49

Summary As your data needs scale across an organization the need for a carefully considered approach to collection, storage, organization, and access becomes increasingly critical. In this episode Todd Walter shares his considerable experience in data curation to clarify the many aspects that are necessary for a successful platform for your business. Using the metaphor of a museum curator carefully managing the precious resources on display and in the vaults, he discusses the various layers of an enterprise data strategy. This includes modeling the lifecycle of your information as a pipeline from the raw, messy, loosely structured records in your data lake, through a series of transformations and ultimately to your data warehouse. He also explains which layers are useful for the different members of the business, and which pitfalls to look out for along the path to a mature and flexible data platform. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. You work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Todd Walter about data curation and how to architect your data systems to support high quality, maintainable intelligence Interview Introduction How did you get involved in the area of data management? How do you define data curation? What are some of the high level concerns that are encapsulated in that effort? How does the size and maturity of a company affect the ways that they architect and interact with their data systems? Can you walk through the stages of an ideal lifecycle for data within the context of an organizations uses for it? What are some of the common mistakes that are made when designing a data architecture and how do they lead to failure? What has changed in terms of complexity and scope for data architecture and curation since you first started working in this space? As “big data” became more widely discussed the common mantra was to store everything because you never know when you’ll need the data that might get thrown away. As the industry is reaching a greater degree of maturity and more regulations are implemented there has been a shift to being more considerate as to what information gets stored and for how long. What are your views on that evolution and what is your litmus test for determining which data to keep? In terms of infrastructure, what are the components of a modern data architecture and how has that changed over the years? What is your opinion on the relative merits of a data warehouse vs a data lake and are they mutually exclusive? Once an architecture has been established, how do you allow for continued evolution to prevent stagnation and eventual failure? ETL has long been the default approach for building and enforcing data architecture, but there have been significant shifts in recent years due to the emergence of streaming systems and ELT approaches in new data warehouses. What are your thoughts on the landscape for managing data flows and migration and when to use which approach? What are some of the areas of data architecture and curation that are most often forgotten or ignored? What resources do you recommend for anyone who is interested in learning more about the landscape of data architecture and curation? Contact Info LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Teradata Data Architecture Data Curation Data Warehouse Chief Data Officer ETL (Extract, Transform, Load) Data Lake Metadata Data Lineage Data Provenance Strata Conference ELT (Extract, Load, Transform) Map-Reduce Hive Pig Spark Data Governance The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Sep 17, 2018 • 48min

Take Control Of Your Web Analytics Using Snowplow With Alexander Dean - Episode 48

Summary Every business with a website needs some way to keep track of how much traffic they are getting, where it is coming from, and which actions are being taken. The default in most cases is Google Analytics, but this can be limiting when you wish to perform detailed analysis of the captured data. To address this problem, Alex Dean co-founded Snowplow Analytics to build an open source platform that gives you total control of your website traffic data. In this episode he explains how the project and company got started, how the platform is architected, and how you can start using it today to get a clearer view of how your customers are interacting with your web and mobile applications. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. You work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat This is your host Tobias Macey and today I’m interviewing Alexander Dean about Snowplow Analytics Interview Introductions How did you get involved in the area of data engineering and data management? What is Snowplow Analytics and what problem were you trying to solve when you started the company? What is unique about customer event data from an ingestion and processing perspective? Challenges with properly matching up data between sources Data collection is one of the more difficult aspects of an analytics pipeline because of the potential for inconsistency or incorrect information. How is the collection portion of the Snowplow stack designed and how do you validate the correctness of the data? Cleanliness/accuracy What kinds of metrics should be tracked in an ingestion pipeline and how do you monitor them to ensure that everything is operating properly? Can you describe the overall architecture of the ingest pipeline that Snowplow provides? How has that architecture evolved from when you first started? What would you do differently if you were to start over today? Ensuring appropriate use of enrichment sources What have been some of the biggest challenges encountered while building and evolving Snowplow? What are some of the most interesting uses of your platform that you are aware of? Keep In Touch Alex @alexcrdean on Twitter LinkedIn Snowplow @snowplowdata on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Snowplow GitHub Deloitte Consulting OpenX Hadoop AWS EMR (Elastic Map-Reduce) Business Intelligence Data Warehousing Google Analytics CRM (Customer Relationship Management) S3 GDPR (General Data Protection Regulation) Kinesis Kafka Google Cloud Pub-Sub JSON-Schema Iglu IAB Bots And Spiders List Heap Analytics Podcast Interview Redshift SnowflakeDB Snowplow Insights Google Cloud Platform Azure GitLab The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

Get the app