Data Engineering Podcast

Tobias Macey
undefined
Feb 4, 2019 • 1h 1min

Cleaning And Curating Open Data For Archaeology

Summary Archaeologists collect and create a variety of data as part of their research and exploration. Open Context is a platform for cleaning, curating, and sharing this data. In this episode Eric Kansa describes how they process, clean, and normalize the data that they host, the challenges that they face with scaling ETL processes which require domain specific knowledge, and how the information contained in connections that they expose is being used for interesting projects. Introduction Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Eric Kansa about Open Context, a platform for publishing, managing, and sharing research data Interview Introduction How did you get involved in the area of data management? I did some database and GIS work for my dissertation in archaeology, back in the late 1990’s. I got frustrated at the lack of comparative data, and I got frustrated at all the work I put into creating data that nobody would likely use. So I decided to focus my energies in research data management. Can you start by describing what Open Context is and how it started? Open Context is an open access data publishing service for archaeology. It started because we need better ways of dissminating structured data and digital media than is possible with conventional articles, books and reports. What are your protocols for determining which data sets you will work with? Datasets need to come from research projects that meet the normal standards of professional conduct (laws, ethics, professional norms) articulated by archaeology’s professional societies. What are some of the challenges unique to research data? What are some of the unique requirements for processing, publishing, and archiving research data? You have to work on a shoe-string budget, essentially providing "public goods". Archaeologists typically don’t have much discretionary money available, and publishing and archiving data are not yet very common practices. Another issues is that it will take a long time to publish enough data to power many "meta-analyses" that draw upon many datasets. The issue is that lots of archaeological data describes very particular places and times. Because datasets can be so particularistic, finding data relevant to your interests can be hard. So, we face a monumental task in supplying enough data to satisfy many, many paricularistic interests. How much education is necessary around your content licensing for researchers who are interested in publishing their data with you? We require use of Creative Commons licenses, and greatly encourage the CC-BY license or CC-Zero (public domain) to try to keep things simple and easy to understand. Can you describe the system architecture that you use for Open Context? Open Context is a Django Python application, with a Postgres database and an Apache Solr index. It’s running on Google cloud services on a Debian linux. What is the process for cleaning and formatting the data that you host? How much domain expertise is necessary to ensure proper conversion of the source data? That’s one of the bottle necks. We have to do an ETL (extract transform load) on each dataset researchers submit for publication. Each dataset may need lots of cleaning and back and forth conversations with data creators. Can you discuss the challenges that you face in maintaining a consistent ontology? What pieces of metadata do you track for a given data set? Can you speak to the average size of data sets that you manage and any approach that you use to optimize for cost of storage and processing capacity? Can you walk through the lifecycle of a given data set? Data archiving is a complicated and difficult endeavor due to issues pertaining to changing data formats and storage media, as well as repeatability of computing environments to generate and/or process them. Can you discuss the technical and procedural approaches that you take to address those challenges? Once the data is stored you expose it for public use via a set of APIs which support linked data. Can you discuss any complexities that arise from needing to identify and expose interrelations between the data sets? What are some of the most interesting uses you have seen of the data that is hosted on Open Context? What have been some of the most interesting/useful/challenging lessons that you have learned while working on Open Context? What are your goals for the future of Open Context? Contact Info @ekansa on Twitter LinkedIn ResearchGate Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Open Context Bronze Age GIS (Geographic Information System) Filemaker Access Database Excel Creative Commons Open Context On Github Django PostgreSQL Apache Solr GeoJSON JSON-LD RDF OCHRE SKOS (Simple Knowledge Organization System) Django Reversion California Digital Library Zenodo CERN Digital Index of North American Archaeology (DINAA) Ansible Docker OpenRefine The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
undefined
Jan 29, 2019 • 42min

Managing Database Access Control For Teams With strongDM

Summary Controlling access to a database is a solved problem… right? It can be straightforward for small teams and a small number of storage engines, but once either or both of those start to scale then things quickly become complex and difficult to manage. After years of running across the same issues in numerous companies and even more projects Justin McCarthy built strongDM to solve database access management for everyone. In this episode he explains how the strongDM proxy works to grant and audit access to storage systems and the benefits that it provides to engineers and team leads. Introduction Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Justin McCarthy about StrongDM, a hosted service that simplifies access controls for your data Interview Introduction How did you get involved in the area of data management? Can you start by explaining the problem that StrongDM is solving and how the company got started? What are some of the most common challenges around managing access and authentication for data storage systems? What are some of the most interesting workarounds that you have seen? Which areas of authentication, authorization, and auditing are most commonly overlooked or misunderstood? Can you describe the architecture of your system? What strategies have you used to enable interfacing with such a wide variety of storage systems? What additional capabilities do you provide beyond what is natively available in the underlying systems? What are some of the most difficult aspects of managing varying levels of permission for different roles across the diversity of platforms that you support, given that they each have different capabilities natively? For a customer who is onboarding, what is involved in setting up your platform to integrate with their systems? What are some of the assumptions that you made about your problem domain and market when you first started which have been disproven? How do organizations in different industries react to your product and how do their policies around granting access to data differ? What are some of the most interesting/unexpected/challenging lessons that you have learned in the process of building and growing StrongDM? Contact Info LinkedIn @justinm on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links StrongDM Authentication Vs. Authorization Hashicorp Vault Configuration Management Chef Puppet SaltStack Ansible Okta SSO (Single Sign On SOC 2 Two Factor Authentication SSH (Secure SHell) RDP The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
undefined
Jan 21, 2019 • 48min

Building Enterprise Big Data Systems At LEGO

Summary Building internal expertise around big data in a large organization is a major competitive advantage. However, it can be a difficult process due to compliance needs and the need to scale globally on day one. In this episode Jesper Søgaard and Keld Antonsen share the story of starting and growing the big data group at LEGO. They discuss the challenges of being at global scale from the start, hiring and training talented engineers, prototyping and deploying new systems in the cloud, and what they have learned in the process. This is a useful conversation for engineers, managers, and leadership who are interested in building enterprise big data systems. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Keld Antonsen and Jesper Soegaard about the data infrastructure and analytics that powers LEGO Interview Introduction How did you get involved in the area of data management? My understanding is that the big data group at LEGO is a fairly recent development. Can you share the story of how it got started? What kinds of data practices were in place prior to starting a dedicated group for managing the organization’s data? What was the transition process like, migrating data silos into a uniformly managed platform? What are the biggest data challenges that you face at LEGO? What are some of the most critical sources and types of data that you are managing? What are the main components of the data infrastructure that you have built to support the organizations analytical needs? What are some of the technologies that you have found to be most useful? Which have been the most problematic? What does the team structure look like for the data services at LEGO? Does that reflect in the types/numbers of systems that you support? What types of testing, monitoring, and metrics do you use to ensure the health of the systems you support? What have been some of the most interesting, challenging, or useful lessons that you have learned while building and maintaining the data platforms at LEGO? How have the data systems at Lego evolved over recent years as new technologies and techniques have been developed? How does the global nature of the LEGO business influence the design strategies and technology choices for your platform? What are you most excited for in the coming year? Contact Info Jesper LinkedIn Keld LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links LEGO Group ERP (Enterprise Resource Planning) Predictive Analytics Prescriptive Analytics Hadoop Center Of Excellence Continuous Integration Spark Podcast Episode Apache NiFi Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast
undefined
Jan 14, 2019 • 41min

TimescaleDB: The Timeseries Database Built For SQL And Scale - Episode 65

TimescaleDB CEO and CTO talk about the 1.0 release, increasing demand for time series databases, distinctions between TimescaleDB and PipelineDB, challenges in reaching the 1.0 release, flexibility of TimeScaleDB, and future plans for scaling and automation.
undefined
Jan 7, 2019 • 51min

Performing Fast Data Analytics Using Apache Kudu - Episode 64

Summary The Hadoop platform is purpose built for processing large, slow moving data in long-running batch jobs. As the ecosystem around it has grown, so has the need for fast data analytics on fast moving data. To fill this need the Kudu project was created with a column oriented table format that was tuned for high volumes of writes and rapid query execution across those tables. For a perfect pairing, they made it easy to connect to the Impala SQL engine. In this episode Brock Noland and Jordan Birdsell from PhData explain how Kudu is architected, how it compares to other storage systems in the Hadoop orbit, and how to start integrating it into you analytics pipeline. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Brock Noland and Jordan Birdsell about Apache Kudu and how it is able to provide fast analytics on fast data in the Hadoop ecosystem Interview Introduction How did you get involved in the area of data management? Can you start by explaining what Kudu is and the motivation for building it? How does it fit into the Hadoop ecosystem? How does it compare to the work being done on the Iceberg table format? What are some of the common application and system design patterns that Kudu supports? How is Kudu architected and how has it evolved over the life of the project? There are many projects in and around the Hadoop ecosystem that rely on Zookeeper as a building block for consensus. What was the reasoning for using Raft in Kudu? How does the storage layer in Kudu differ from what would be found in systems like Hive or HBase? What are the implementation details in the Kudu storage interface that have had the greatest impact on its overall speed and performance? A number of the projects built for large scale data processing were not initially built with a focus on operational simplicity. What are the features of Kudu that simplify deployment and management of production infrastructure? What was the motivation for using C++ as the language target for Kudu? If you were to start the project over today what would you do differently? What are some situations where you would advise against using Kudu? What have you found to be the most interesting/unexpected/challenging lessons learned in the process of building and maintaining Kudu? What are you most excited about for the future of Kudu? Contact Info Brock LinkedIn @brocknoland on Twitter Jordan LinkedIn @jordanbirdsell jbirdsell on GitHub PhData Website phdata on GitHub @phdatainc on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Kudu PhData Getting Started with Apache Kudu Thomson Reuters Hadoop Oracle Exadata Slowly Changing Dimensions HDFS S3 Azure Blob Storage State Farm Stanly Black & Decker ETL (Extract, Transform, Load) Parquet Podcast Episode ORC HBase Spark Podcast Episode Impala Netflix Iceberg Podcast Episode Hive ACID IOT (Internet Of Things) Streamsets NiFi Podcast Episode Kafka Connect Moore’s Law 3D XPoint Raft Consensus Algorithm STONITH (Shoot The Other Node In The Head) Yarn Cython Podcast.__init__ Episode Pandas Podcast.__init__ Episode Cloudera Manager Apache Sentry Collibra The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast
undefined
Dec 31, 2018 • 45min

Simplifying Continuous Data Processing Using Stream Native Storage In Pravega with Tom Kaitchuck - Episode 63

Summary As more companies and organizations are working to gain a real-time view of their business, they are increasingly turning to stream processing technologies to fullfill that need. However, the storage requirements for continuous, unbounded streams of data are markedly different than that of batch oriented workloads. To address this shortcoming the team at Dell EMC has created the open source Pravega project. In this episode Tom Kaitchuk explains how Pravega simplifies storage and processing of data streams, how it integrates with processing engines such as Flink, and the unique capabilities that it provides in the area of exactly once processing and transactions. And if you listen at approximately the half-way mark, you can hear as the hosts mind is blown by the possibilities of treating everything, including schema information, as a stream. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Tom Kaitchuck about Pravega, an open source data storage platform optimized for persistent streams Interview Introduction How did you get involved in the area of data management? Can you start by explaining what Pravega is and the story behind it? What are the use cases for Pravega and how does it fit into the data ecosystem? How does it compare with systems such as Kafka and Pulsar for ingesting and persisting unbounded data? How do you represent a stream on-disk? What are the benefits of using this format for persisted streams? One of the compelling aspects of Pravega is the automatic sharding and resource allocation for variations in data patterns. Can you describe how that operates and the benefits that it provides? I am also intrigued by the automatic tiering of the persisted storage. How does that work and what options exist for managing the lifecycle of the data in the cluster? For someone who wants to build an application on top of Pravega, what interfaces does it provide and what architectural patterns does it lend itself toward? What are some of the unique system design patterns that are made possible by Pravega? How is Pravega architected internally? What is involved in integrating engines such as Spark, Flink, or Storm with Pravega? A common challenge for streaming systems is exactly once semantics. How does Pravega approach that problem? Does it have any special capabilities for simplifying processing of out-of-order events? For someone planning a deployment of Pravega, what is involved in building and scaling a cluster? What are some of the operational edge cases that users should be aware of? What are some of the most interesting, useful, or challenging experiences that you have had while building Pravega? What are some cases where you would recommend against using Pravega? What is in store for the future of Pravega? Contact Info tkaitchuk on GitHub LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Pravega Amazon SQS (Simple Queue Service) Amazon Simple Workflow Service (SWF) Azure EMC Zookeeper Podcast Episode Bookkeeper Kafka Pulsar Podcast Episode RocksDB Flink Podcast Episode Spark Podcast Episode Heron Lambda Architecture Kappa Architecture Erasure Code Flink Forward Conference CAP Theorem The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast
undefined
Dec 24, 2018 • 1h 4min

Continuously Query Your Time-Series Data Using PipelineDB with Derek Nelson and Usman Masood - Episode 62

Summary Processing high velocity time-series data in real-time is a complex challenge. The team at PipelineDB has built a continuous query engine that simplifies the task of computing aggregates across incoming streams of events. In this episode Derek Nelson and Usman Masood explain how it is architected, strategies for designing your data flows, how to scale it up and out, and edge cases to be aware of. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Usman Masood and Derek Nelson about PipelineDB, an open source continuous query engine for PostgreSQL Interview Introduction How did you get involved in the area of data management? Can you start by explaining what PipelineDB is and the motivation for creating it? What are the major use cases that it enables? What are some example applications that are uniquely well suited to the capabilities of PipelineDB? What are the major concepts and components that users of PipelineDB should be familiar with? Given the fact that it is a plugin for PostgreSQL, what level of compatibility exists between PipelineDB and other plugins such as Timescale and Citus? What are some of the common patterns for populating data streams? What are the options for scaling PipelineDB systems, both vertically and horizontally? How much elasticity does the system support in terms of changing volumes of inbound data? What are some of the limitations or edge cases that users should be aware of? Given that inbound data is not persisted to disk, how do you guard against data loss? Is it possible to archive the data in a stream, unaltered, to a separate destination table or other storage location? Can a separate table be used as an input stream? Since the data being processed by the continuous queries is potentially unbounded, how do you approach checkpointing or windowing the data in the continuous views? What are some of the features that you have found to be the most useful which users might initially overlook? What would be involved in generating an alert or notification on an aggregate output that was in some way anomalous? What are some of the most challenging aspects of building continuous aggregates on unbounded data? What have you found to be some of the most interesting, complex, or challenging aspects of building and maintaining PipelineDB? What are some of the most interesting or unexpected ways that you have seen PipelineDB used? When is PipelineDB the wrong choice? What do you have planned for the future of PipelineDB now that you have hit the 1.0 milestone? Contact Info Derek derekjn on GitHub LinkedIn Usman @usmanm on Twitter Website Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links PipelineDB Stride PostgreSQL Podcast Episode AdRoll Probabilistic Data Structures TimescaleDB [Podcast Episode]( Hive Redshift Kafka Kinesis ZeroMQ Nanomsg HyperLogLog Bloom Filter The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast
undefined
Dec 17, 2018 • 39min

Advice On Scaling Your Data Pipeline Alongside Your Business with Christian Heinzmann - Episode 61

Summary Every business needs a pipeline for their critical data, even if it is just pasting into a spreadsheet. As the organization grows and gains more customers, the requirements for that pipeline will change. In this episode Christian Heinzmann, Head of Data Warehousing at Grubhub, discusses the various requirements for data pipelines and how the overall system architecture evolves as more data is being processed. He also covers the changes in how the output of the pipelines are used, how that impacts the expectations for accuracy and availability, and some useful advice on build vs. buy for the components of a data platform. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Christian Heinzmann about how data pipelines evolve as your business grows Interview Introduction How did you get involved in the area of data management? Can you start by sharing your definition of a data pipeline? At what point in the life of a project or organization should you start thinking about building a pipeline? In the early stages when the scale of the data and business are still small, what are some of the design characteristics that you should be targeting for your pipeline? What metrics/use cases should you be optimizing for at this point? What are some of the indicators that you look for to signal that you are reaching the next order of magnitude in terms of scale? How do the design requirements for a data pipeline change as you reach this stage? What are some of the challenges and complexities that begin to present themselves as you build and run your pipeline at medium scale? What are some of the changes that are necessary as you move to a large scale data pipeline? At each level of scale it is important to minimize the impact of the ETL process on the source systems. What are some strategies that you have employed to avoid degrading the performance of the application systems? In recent years there has been a shift to using data lakes as a staging ground before performing transformations. What are your thoughts on that approach? When performing transformations there is a potential for discarding information or losing fidelity. How have you worked to reduce the impact of this effect? Transformations of the source data can be brittle when the format or volume changes. How do you design the pipeline to be resilient to these types of changes? What are your selection criteria when determining what workflow or ETL engines to use in your pipeline? How has your preference of build vs buy changed at different scales of operation and as new/different projects become available? What are some of the dead ends or edge cases that you have had to deal with in your current role at Grubhub? What are some of the common mistakes or overlooked aspects of building a data pipeline that you have seen? What are your plans for improving your current pipeline at Grubhub? What are some references that you recommend for anyone who is designing a new data platform? Contact Info @sirchristian on Twitter Blog sirchristian on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Scaling ETL blog post GrubHub Data Warehouse Redshift Spark Spark In Action Podcast Episode Hive Amazon EMR Looker Podcast Episode Redash Metabase Podcast Episode A Primer on Enterprise Data Curation Pub/Sub (Publish-Subscribe Pattern) Change Data Capture Jenkins Python Azkaban Luigi Zendesk Data Lineage AirBnB Engineering Blog The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast
undefined
Dec 10, 2018 • 51min

Putting Apache Spark Into Action with Jean Georges Perrin - Episode 60

Summary Apache Spark is a popular and widely used tool for a variety of data oriented projects. With the large array of capabilities, and the complexity of the underlying system, it can be difficult to understand how to get started using it. Jean George Perrin has been so impressed by the versatility of Spark that he is writing a book for data engineers to hit the ground running. In this episode he helps to make sense of what Spark is, how it works, and the various ways that you can use it. He also discusses what you need to know to get it deployed and keep it running in a production environment and how it fits into the overall data ecosystem. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Jean Georges Perrin, author of the upcoming Manning book Spark In Action 2nd Edition, about the ways that Spark is used and how it fits into the data landscape Interview Introduction How did you get involved in the area of data management? Can you start by explaining what Spark is? What are some of the main use cases for Spark? What are some of the problems that Spark is uniquely suited to address? Who uses Spark? What are the tools offered to Spark users? How does it compare to some of the other streaming frameworks such as Flink, Kafka, or Storm? For someone building on top of Spark what are the main software design paradigms? How does the design of an application change as you go from a local development environment to a production cluster? Once your application is written, what is involved in deploying it to a production environment? What are some of the most useful strategies that you have seen for improving the efficiency and performance of a processing pipeline? What are some of the edge cases and architectural considerations that engineers should be considering as they begin to scale their deployments? What are some of the common ways that Spark is deployed, in terms of the cluster topology and the supporting technologies? What are the limitations of the Spark programming model? What are the cases where Spark is the wrong choice? What was your motivation for writing a book about Spark? Who is the target audience? What have been some of the most interesting or useful lessons that you have learned in the process of writing a book about Spark? What advice do you have for anyone who is considering or currently using Spark? Contact Info @jgperrin on Twitter Blog Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Book Discount Use the code poddataeng18 to get 40% off of all of Manning’s products at manning.com Links Apache Spark Spark In Action Book code examples in GitHub Informix International Informix Users Group MySQL Microsoft SQL Server ETL (Extract, Transform, Load) Spark SQL and Spark In Action‘s chapter 11 Spark ML and Spark In Action‘s chapter 18 Spark Streaming (structured) and Spark In Action‘s chapter 10 Spark GraphX Hadoop Jupyter Podcast Interview Zeppelin Databricks IBM Watson Studio Kafka Flink Podcast Episode AWS Kinesis Yarn HDFS Hive Scala PySpark DAG Spark Catalyst Spark Tungsten Spark UDF AWS EMR Mesos DC/OS Kubernetes Dataframes The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast
undefined
Dec 3, 2018 • 54min

Apache Zookeeper As A Building Block For Distributed Systems with Patrick Hunt - Episode 59

Summary Distributed systems are complex to build and operate, and there are certain primitives that are common to a majority of them. Rather then re-implement the same capabilities every time, many projects build on top of Apache Zookeeper. In this episode Patrick Hunt explains how the Apache Zookeeper project was started, how it functions, and how it is used as a building block for other distributed systems. He also explains the operational considerations for running your own cluster, how it compares to more recent entrants such as Consul and EtcD, and what is in store for the future. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Patrick Hunt about Apache Zookeeper and how it is used as a building block for distributed systems Interview Introduction How did you get involved in the area of data management? Can you start by explaining what Zookeeper is and how the project got started? What are the main motivations for using a centralized coordination service for distributed systems? What are the distributed systems primitives that are built into Zookeeper? What are some of the higher-order capabilities that Zookeeper provides to users who are building distributed systems on top of Zookeeper? What are some of the types of system level features that application developers will need which aren’t provided by Zookeeper? Can you discuss how Zookeeper is architected and how that design has evolved over time? What have you found to be some of the most complicated or difficult aspects of building and maintaining Zookeeper? What are the scaling factors for Zookeeper? What are the edge cases that users should be aware of? Where does it fall on the axes of the CAP theorem? What are the main failure modes for Zookeeper? How much of the recovery logic is left up to the end user of the Zookeeper cluster? Since there are a number of projects that rely on Zookeeper, many of which are likely to be run in the same environment (e.g. Kafka and Flink), what would be involved in sharing a single Zookeeper cluster among those multiple services? In recent years we have seen projects such as EtcD which is used by Kubernetes, and Consul. How does Zookeeper compare with those projects? What are some of the cases where Zookeeper is the wrong choice? How have the needs of distributed systems engineers changed since you first began working on Zookeeper? If you were to start the project over today, what would you do differently? Would you still use Java? What are some of the most interesting or unexpected ways that you have seen Zookeeper used? What do you have planned for the future of Zookeeper? Contact Info @phunt on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Zookeeper Cloudera Google Chubby Sourceforge HBase High Availability Fallacies of distributed computing Falsehoods programmers believe about networking Consul EtcD Apache Curator Raft Consensus Algorithm Zookeeper Atomic Broadcast SSD Write Cliff Apache Kafka Apache Flink Podcast Episode HDFS Kubernetes Netty Protocol Buffers Avro Rust The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app