Data Engineering Podcast

Tobias Macey

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.

Episodes

Mentioned books

Jun 15, 2020 • 46min

Accelerate Your Machine Learning With The StreamSQL Feature Store

Summary Machine learning is a process driven by iteration and experimentation which requires fast and easy access to relevant features of the data being processed. In order to reduce friction in the process of developing and delivering models there has been a recent trend toward building a dedicated feature. In this episode Simba Khadder discusses his work at StreamSQL building a feature store to make creation, discovery, and monitoring of features fast and easy to manage. He describes the architecture of the system, the benefits of streaming data for machine learning, and how a feature store provides a useful interface between data engineers and machine learning engineers to reduce communication overhead. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Your host is Tobias Macey and today I’m interviewing Simba Khadder about his views on the importance of ML feature stores, and his experience implementing one at StreamSQL Interview Introduction How did you get involved in the areas of machine learning and data management? What is StreamSQL and what motivated you to start the business? Can you describe what a machine learning feature is? What is the difference between generating features for training a model and generating features for serving? How is feature management typically handled today? What is a feature store and how is it different from the status quo? What is the overall lifecycle of identifying useful features, defining and generating them, using them for training, and then serving them in production? How does the usage of a feature store impact the workflow of ML engineers/data scientists and data engineers? What are the general requirements of a feature store? What additional capabilities or tangential services are necessary for providing a pleasant UX for a feature store? How is discovery and documentation of features handled? What is the current landscape of feature stores and how does StreamSQL compare? How is the StreamSQL feature store implemented? How is the supporting infrastructure architected and how has it evolved since you first began working on it? Why is streaming data such a focal point of feature stores? How do you generate features for training? How do you approach monitoring of features and what does remediation look like for a feature that is no longer valid? How do you handle versioning and deploying features? What’s the process for integrating data sources into StreamSQL for processing into features? How are the features materialized? What are the most challenging or complex aspects of working on or with a feature store? When is StreamSQL the wrong choice for a feature store? What are the most interesting, challenging, or unexpected lessons that you have learned in the process of building StreamSQL? What do you have planned for the future of the product? Contact Info LinkedIn @simba_khadder on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links StreamSQL Feature Stores for ML Distributed Systems Google Cloud Datastore Triton Uber Michelangelo AirBnB Zipline Lyft Dryft Apache Flink Podcast Episode Apache Kafka Spark Streaming Apache Cassandra Redis Apache Pulsar Podcast Episode StreamNative Episode TDD == Test Driven Development Lyft presentation – Bootstrapping Flink Go-Jek Feast Hopsworks The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Jun 8, 2020 • 55min

Data Management Trends From An Investor Perspective

Summary The landscape of data management and processing is rapidly changing and evolving. There are certain foundational elements that have remained steady, but as the industry matures new trends emerge and gain prominence. In this episode Astasia Myers of Redpoint Ventures shares her perspective as an investor on which categories she is paying particular attention to for the near to medium term. She discusses the work being done to address challenges in the areas of data quality, observability, discovery, and streaming. This is a useful conversation to gain a macro perspective on where businesses are looking to improve their capabilities to work with data. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar to get you up and running in no time. With simple pricing, fast networking, S3 compatible object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! You listen to this show because you love working with data and want to keep your skills up to date. Machine learning is finding its way into every aspect of the data landscape. Springboard has partnered with us to help you take the next step in your career by offering a scholarship to their Machine Learning Engineering career track program. In this online, project-based course every student is paired with a Machine Learning expert who provides unlimited 1:1 mentorship support throughout the program via video conferences. You’ll build up your portfolio of machine learning projects and gain hands-on experience in writing machine learning algorithms, deploying models into production, and managing the lifecycle of a deep learning prototype. Springboard offers a job guarantee, meaning that you don’t have to pay for the program until you get a job in the space. The Data Engineering Podcast is exclusively offering listeners 20 scholarships of $500 to eligible applicants. It only takes 10 minutes and there’s no obligation. Go to dataengineeringpodcast.com/springboard and apply today! Make sure to use the code AISPRINGBOARD when you enroll. Your host is Tobias Macey and today I’m interviewing Astasia Myers about the trends in the data industry that she sees as an investor at Redpoint Ventures Interview Introduction How did you get involved in the area of data management? Can you start by giving an overview of Redpoint Ventures and your role there? From an investor perspective, what is most appealing about the category of data-oriented businesses? What are the main sources of information that you rely on to keep up to date with what is happening in the data industry? What is your personal heuristic for determining the relevance of any given piece of information to decide whether it is worthy of further investigation? As someone who works closely with a variety of companies across different industry verticals and different areas of focus, what are some of the common trends that you have identified in the data ecosystem? In your article that covers the trends you are keeping an eye on for 2020 you call out 4 in particular, data quality, data catalogs, observability of what influences critical business indicators, and streaming data. Taking those in turn: What are the driving factors that influence data quality, and what elements of that problem space are being addressed by the companies you are watching? What are the unsolved areas that you see as being viable for newcomers? What are the challenges faced by businesses in establishing and maintaining data catalogs? What approaches are being taken by the companies who are trying to solve this problem? What shortcomings do you see in the available products? For gaining visibility into the forces that impact the key performance indicators (KPI) of businesses, what is lacking in the current approaches? What additional information needs to be tracked to provide the needed context for making informed decisions about what actions to take to improve KPIs? What challenges do businesses in this observability space face to provide useful access and analysis to this collected data? Streaming is an area that has been growing rapidly over the past few years, with many open source and commercial options. What are the major business opportunities that you see to make streaming more accessible and effective? What are the main factors that you see as driving this growth in the need for access to streaming data? With your focus on these trends, how does that influence your investment decisions and where you spend your time? What are the unaddressed markets or product categories that you see which would be lucrative for new businesses? In most areas of technology now there is a mix of open source and commercial solutions to any given problem, with varying levels of maturity and polish between them. What are your views on the balance of this relationship in the data ecosystem? For data in particular, there is a strong potential for vendor lock-in which can cause potential customers to avoid adoption of commercial solutions. What has been your experience in that regard with the companies that you work with? Contact Info @AstasiaMyers on Twitter @astasia on Medium LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Redpoint Ventures 4 Data Trends To Watch in 2020 Seagate Western Digital Pure Storage Cisco Cohesity Looker Podcast Episode DGraph Podcast Episode Dremio Podcast Episode SnowflakeDB Podcast Episode Thoughspot Tibco Elastic Splunk Informatica Data Council DataCoral Mattermost Bitwarden Snowplow Podcast Interview Interview About Snowplow Infrastructure CHAOSSEARCH Podcast Episode Kafka Streams Pulsar Podcast Interview Followup Podcast Interview Soda Toro Great Expectations Alation Collibra Amundsen DataHub Netflix Metacat Marquez Podcast Episode LDAP == Lightweight Directory Access Protocol Anodot Databricks Flink

Jun 2, 2020 • 56min

Building A Data Lake For The Database Administrator At Upsolver

Summary Data lakes offer a great deal of flexibility and the potential for reduced cost for your analytics, but they also introduce a great deal of complexity. What used to be entirely managed by the database engine is now a composition of multiple systems that need to be properly configured to work in concert. In order to bring the DBA into the new era of data management the team at Upsolver added a SQL interface to their data lake platform. In this episode Upsolver CEO Ori Rafael and CTO Yoni Iny describe how they have grown their platform deliberately to allow for layering SQL on top of a robust foundation for creating and operating a data lake, how to bring more people on board to work with the data being collected, and the unique benefits that a data lake provides. This was an interesting look at the impact that the interface to your data can have on who is empowered to work with it. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! You listen to this show because you love working with data and want to keep your skills up to date. Machine learning is finding its way into every aspect of the data landscape. Springboard has partnered with us to help you take the next step in your career by offering a scholarship to their Machine Learning Engineering career track program. In this online, project-based course every student is paired with a Machine Learning expert who provides unlimited 1:1 mentorship support throughout the program via video conferences. You’ll build up your portfolio of machine learning projects and gain hands-on experience in writing machine learning algorithms, deploying models into production, and managing the lifecycle of a deep learning prototype. Springboard offers a job guarantee, meaning that you don’t have to pay for the program until you get a job in the space. The Data Engineering Podcast is exclusively offering listeners 20 scholarships of $500 to eligible applicants. It only takes 10 minutes and there’s no obligation. Go to dataengineeringpodcast.com/springboard and apply today! Make sure to use the code AISPRINGBOARD when you enroll. Your host is Tobias Macey and today I’m interviewing Ori Rafael and Yoni Iny about building a data lake for the DBA at Upsolver Interview Introduction How did you get involved in the area of data management? Can you start by sharing your definition of what a data lake is and what it is comprised of? We talked last in November of 2018. How has the landscape of data lake technologies and adoption changed in that time? How has Upsolver changed or evolved since we last spoke? How has the evolution of the underlying technologies impacted your implementation and overall product strategy? What are some of the common challenges that accompany a data lake implementation? How do those challenges influence the adoption or viability of a data lake? How does the introduction of a universal SQL layer change the staffing requirements for building and maintaining a data lake? What are the advantages of a data lake over a data warehouse if everything is being managed via SQL anyway? What are some of the underlying realities of the data systems that power the lake which will eventually need to be understood by the operators of the platform? How is the SQL layer in Upsolver implemented? What are the most challenging or complex aspects of managing the underlying technologies to provide automated partitioning, indexing, etc.? What are the main concepts that you need to educate your customers on? What are some of the pitfalls that users should be aware of? What features of your platform are often overlooked or underutilized which you think should be more widely adopted? What have you found to be the most interesting, unexpected, or challenging lessons learned while building the technical and business elements of Upsolver? What do you have planned for the future? Contact Info Ori LinkedIn Yoni yoniiny on GitHub LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Upsolver Podcast Episode DBA == Database Administrator IDF == Israel Defense Forces Data Lake Eventual Consistency Apache Spark Redshift Spectrum Azure Synapse Analytics SnowflakeDB Podcast Episode BigQuery Presto Podcast Episode Apache Kafka Cartesian Product kSQLDB Podcast Episode Eventador Podcast Episode Materialize Podcast Episode Common Table Expressions Lambda Architecture Kappa Architecture Apache Flink Podcast Episode Reinforcement Learning Cloudformation GDPR The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

May 25, 2020 • 47min

Mapping The Customer Journey For B2B Companies At Dreamdata

Summary Gaining a complete view of the customer journey is especially difficult in B2B companies. This is due to the number of different individuals involved and the myriad ways that they interface with the business. Dreamdata integrates data from the multitude of platforms that are used by these organizations so that they can get a comprehensive view of their customer lifecycle. In this episode Ole Dallerup explains how Dreamdata was started, how their platform is architected, and the challenges inherent to data management in the B2B space. This conversation is a useful look into how data engineering and analytics can have a direct impact on the success of the business. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! You listen to this show because you love working with data and want to keep your skills up to date. Machine learning is finding its way into every aspect of the data landscape. Springboard has partnered with us to help you take the next step in your career by offering a scholarship to their Machine Learning Engineering career track program. In this online, project-based course every student is paired with a Machine Learning expert who provides unlimited 1:1 mentorship support throughout the program via video conferences. You’ll build up your portfolio of machine learning projects and gain hands-on experience in writing machine learning algorithms, deploying models into production, and managing the lifecycle of a deep learning prototype. Springboard offers a job guarantee, meaning that you don’t have to pay for the program until you get a job in the space. The Data Engineering Podcast is exclusively offering listeners 20 scholarships of $500 to eligible applicants. It only takes 10 minutes and there’s no obligation. Go to dataengineeringpodcast.com/springboard and apply today! Make sure to use the code AISPRINGBOARD when you enroll. Your host is Tobias Macey and today I’m interviewing Ole Dallerup about Dreamdata, a platform for simplifying data integration for B2B companies Interview Introduction How did you get involved in the area of data management? Can you start by describing what you are building at Dreamata? What was your inspiration for starting a company and what keeps you motivated? How do the data requirements differ between B2C and B2B companies? What are the challenges that B2B companies face in gaining visibility across the lifecycle of their customers? How does that lack of visibility impact the viability or growth potential of the business? What are the factors that contribute to silos in visibility of customer activity within a business? What are the data sources that you are dealing with to generate meaningful analytics for your customers? What are some of the challenges that business face in either generating or collecting useful information about their customer interactions? How is the technical platform of Dreamdata implemented and how has it evolved since you first began working on it? What are some of the ways that you approach entity resolution across the different channels and data sources? How do you reconcile the information collected from different sources that might use disparate data formats and representations? What is the onboarding process for your customers to identify and integrate with all of their systems? How do you approach the definition of the schema model for the database that your customers implement for storing their footprint? Do you allow for customization by the customer? Do you rely on a tool such as DBT for populating the table definitions and transformations from the source data? How do you approach representation of the analysis and actionable insights to your customers so that they are able to accurately intepret the results? How have your own experiences at Dreamdata influenced the areas that you invest in for the product? What are some of the most interesting or surprising insights that you have been able to gain as a result of the unified view that you are building? What are some of the most challenging, interesting, or unexpected lessons that you have learned from building and growing the technical and business elements of Dreamdata? When might a user be better served by building their own pipelines or analysis for tracking their customer interactions? What do you have planned for the future of Dreamdata? What are some of the industry trends that you are keeping an eye on and what potential impacts to your business do you anticipate? Contact Info LinkedIn @oledallerup on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Dreamdata Poker Tracker TrustPilot Zendesk Salesforce Hubspot Google BigQuery SnowflakeDB Podcast Episode AWS Redshift Singer Stitch Data Dataform Podcast Episode DBT Podcast Episode Segment Podcast Episode Cloud Dataflow Apache Beam UTM Parameters Clearbit Capterra G2 Crowd The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

May 18, 2020 • 53min

Power Up Your PostgreSQL Analytics With Swarm64

Summary The PostgreSQL database is massively popular due to its flexibility and extensive ecosystem of extensions, but it is still not the first choice for high performance analytics. Swarm64 aims to change that by adding support for advanced hardware capabilities like FPGAs and optimized usage of modern SSDs. In this episode CEO and co-founder Thomas Richter discusses his motivation for creating an extension to optimize Postgres hardware usage, the benefits of running your analytics on the same platform as your application, and how it works under the hood. If you are trying to get more performance out of your database then this episode is for you! Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! You monitor your website to make sure that you’re the first to know when something goes wrong, but what about your data? Tidy Data is the DataOps monitoring platform that you’ve been missing. With real time alerts for problems in your databases, ETL pipelines, or data warehouse, and integrations with Slack, Pagerduty, and custom webhooks you can fix the errors before they become a problem. Go to dataengineeringpodcast.com/tidydata today and get started for free with no credit card required. Your host is Tobias Macey and today I’m interviewing Thomas Richter about Swarm64, a PostgreSQL extension to improve parallelism and add support for FPGAs Interview Introduction How did you get involved in the area of data management? Can you start by explaining what Swarm64 is? How did the business get started and what keeps you motivated? What are some of the common bottlenecks that users of postgres run into? What are the use cases and workloads that gain the most benefit from increased parallelism in the database engine? By increasing the processing throughput of the database, how does that impact disk I/O and what are some options for avoiding bottlenecks in the persistence layer? Can you describe how Swarm64 is implemented? How has the product evolved since you first began working on it? How has the evolution of postgres impacted your product direction? What are some of the notable challenges that you have dealt with as a result of upstream changes in postgres? How has the hardware landscape evolved and how does that affect your prioritization of features and improvements? What are some of the other extensions in the postgres ecosystem that are most commonly used alongside Swarm64? Which extensions conflict with yours and how does that impact potential adoption? In addition to your work to optimize performance of the postres engine, you also provide support for using an FPGA as a co-processor. What are the benefits that an FPGA provides over and above a CPU or GPU architecture? What are the available options for provisioning hardware in a datacenter or the cloud that has access to an FPGA? Most people are familiar with the relevant attributes for selecting a CPU or GPU, what are the specifications that they should be looking at when selecting an FPGA? For users who are adopting Swarm64, how does it impact the way they should be thinking of their data models? What is involved in migrating an existing database to use Swarm64? What are some of the most interesting, unexpected, or challenging lessons that you have learned while building and growing the product and business of Swarm64? When is Swarm64 the wrong choice? What do you have planned for the future of Swarm64? Contact Info LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Swarm64 Lufthansa Cargo IBM Cognos Analytics OLAP Cube PostgreSQL Geospatial Data TimescaleDB Podcast Episode FPGA == Field Programmable Gate Array Greenplum Foreign Data Tables PostgreSQL Table Storage API EnterpriseDB Xilinx OVH Cloud Nimbix Azure Tableau The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

May 11, 2020 • 55min

StreamNative Brings Streaming Data To The Cloud Native Landscape With Pulsar

Summary There have been several generations of platforms for managing streaming data, each with their own strengths and weaknesses, and different areas of focus. Pulsar is one of the recent entrants which has quickly gained adoption and an impressive set of capabilities. In this episode Sijie Guo discusses his motivations for spending so much of his time and energy on contributing to the project and growing the community. His most recent endeavor at StreamNative is focused on combining the capabilities of Pulsar with the cloud native movement to make it easier to build and scale real time messaging systems with built in event processing capabilities. This was a great conversation about the strengths of the Pulsar project, how it has evolved in recent years, and some of the innovative ways that it is being used. Pulsar is a well engineered and robust platform for building the core of any system that relies on durable access to easily scalable streams of data. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! You monitor your website to make sure that you’re the first to know when something goes wrong, but what about your data? Tidy Data is the DataOps monitoring platform that you’ve been missing. With real time alerts for problems in your databases, ETL pipelines, or data warehouse, and integrations with Slack, Pagerduty, and custom webhooks you can fix the errors before they become a problem. Go to dataengineeringpodcast.com/tidydata today and get started for free with no credit card required. Your host is Tobias Macey and today I’m interviewing Sijie Guo about the current state of the Pulsar framework for stream processing and his experiences building a managed offering for it at StreamNative Interview Introduction How did you get involved in the area of data management? Can you start by giving an overview of what Pulsar is? How did you get involved with the project? What is Pulsar’s role in the lifecycle of data and where does it fit in the overall ecosystem of data tools? How has the Pulsar project evolved or changed over the past 2 years? How has the overall state of the ecosystem influenced the direction that Pulsar has taken? One of the critical elements in the success of a piece of technology is the ecosystem that grows around it. How has the community responded to Pulsar, and what are some of the barriers to adoption? How are you and other project leaders addressing those barriers? You were a co-founder at Streamlio, which was built on top of Pulsar, and now you have founded StreamNative to offer Pulsar as a service. What did you learned from your time at Streamlio that has been most helpful in your current endeavor? How would you characterize your relationship with the project and community in each role? What motivates you to dedicate so much of your time and enery to Pulsar in particular, and the streaming data ecosystem in general? Why is streaming data such an important capability? How have projects such as Kafka and Pulsar impacted the broader software and data landscape? What are some of the most interesting, innovative, or unexpected ways that you have seen Pulsar used? When is Pulsar the wrong choice? What do you have planned for the future of StreamNative? Contact Info LinkedIn @sijieg on Twitter sijie on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Apache Pulsar Podcast Episode StreamNative Streamlio Hadoop HBase Hive Tencent Yahoo BookKeeper Publish/Subscribe Kafka Zookeeper Podcast Episode Kafka Connect Pulsar Functions Pulsar IO Kafka On Pulsar Webinar Video Pulsar Protocol Handler OVH Cloud Open Messaging ActiveMQ Kubernetes Helm Pulsar Helm Charts Grafana BestPay(?) Lambda Architecture Event Sourcing WebAssembly Apache Flink Podcast Episode Pulsar Summit The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

May 4, 2020 • 46min

Enterprise Data Operations And Orchestration At Infoworks

Summary Data management is hard at any scale, but working in the context of an enterprise organization adds even greater complexity. Infoworks is a platform built to provide a unified set of tooling for managing the full lifecycle of data in large businesses. By reducing the barrier to entry with a graphical interface for defining data transformations and analysis, it makes it easier to bring the domain experts into the process. In this interview co-founder and CTO of Infoworks Amar Arsikere explains the unique challenges faced by enterprise organizations, how the platform is architected to provide the needed flexibility and scale, and how a unified platform for data improves the outcomes of the organizations using it. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Free yourself from maintaining brittle data pipelines that require excessive coding and don’t operationally scale. With the Ascend Unified Data Engineering Platform, you and your team can easily build autonomous data pipelines that dynamically adapt to changes in data, code, and environment — enabling 10x faster build velocity and automated maintenance. On Ascend, data engineers can ingest, build, integrate, run, and govern advanced data pipelines with 95% less code. Go to dataengineeringpodcast.com/ascend to start building with a free 30-day trial. You’ll partner with a dedicated data engineer at Ascend to help you get started and accelerate your journey from prototype to production. Your host is Tobias Macey and today I’m interviewing Amar Arsikere about the Infoworks platform for enterprise data operations and orchestration Interview Introduction How did you get involved in the area of data management? Can you start by describing what you have built at Infoworks and the story of how it got started? What are the fundamental challenges that often plague organizations dealing with "big data"? How do those challenges change or compound in the context of an enterprise organization? What are some of the unique needs that enterprise organizations have of their data? What are the design or technical limitations of existing big data technologies that contribute to the overall difficulty of using or integrating them effectively? What are some of the tools or platforms that InfoWorks replaces in the overall data lifecycle? How do you identify and prioritize the integrations that you build? How is Infoworks itself architected and how has it evolved since you first built it? Discoverability and reuse of data is one of the biggest challenges facing organizations of all sizes. How do you address that in your platform? What are the roles that use InfoWorks in their day-to-day? What does the workflow look like for each of those roles? Can you talk through the overall lifecycle of a unit of data in InfoWorks and the different subsystems that it interacts with at each stage? What are some of the design challenges that you face in building a UI oriented workflow while providing the necessary level of control for these systems? How do you handle versioning of pipelines and validation of new iterations prior to production release? What are the cases where the no code, graphical paradigm for data orchestration breaks down? What are some of the most challenging, interesting, or unexpected lessons that you have learned since starting Infoworks? Contact Info LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links InfoWorks Google BigTable Apache Spark Apache Hadoop Zynga Data Partitioning Informatica Pentaho Talend Apache NiFi GoldenGate BigQuery Change Data Capture Podcast Episode About Debezium Slowly Changing Dimensions Snowflake DB Podcast Episode Tableau Data Catalog The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Apr 28, 2020 • 1h 2min

Taming Complexity In Your Data Driven Organization With DataOps

Summary Data is a critical element to every role in an organization, which is also what makes managing it so challenging. With so many different opinions about which pieces of information are most important, how it needs to be accessed, and what to do with it, many data projects are doomed to failure. In this episode Chris Bergh explains how taking an agile approach to delivering value can drive down the complexity that grows out of the varied needs of the business. Building a DataOps workflow that incorporates fast delivery of well defined projects, continuous testing, and open lines of communication is a proven path to success. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! If DataOps sounds like the perfect antidote to your pipeline woes, DataKitchen is here to help. DataKitchen’s DataOps Platform automates and coordinates all the people, tools, and environments in your entire data analytics organization – everything from orchestration, testing and monitoring to development and deployment. In no time, you’ll reclaim control of your data pipelines so you can start delivering business value instantly, without errors. Go to dataengineeringpodcast.com/datakitchen today to learn more and thank them for supporting the show! Your host is Tobias Macey and today I’m welcoming back Chris Bergh to talk about ways that DataOps principles can help to reduce organizational complexity Interview Introduction How did you get involved in the area of data management? How are typical data and analytic teams organized? What are their roles and structure? Can you start by giving an outline of the ways that complexity can manifest in a data organization? What are some of the contributing factors that generate this complexity? How does the size or scale of an organization and their data needs impact the segmentation of responsibilities and roles? How does this organizational complexity play out within a single team? For example between data engineers, data scientists, and production/operations? How do you approach the definition of useful interfaces between different roles or groups within an organization? What are your thoughts on the relationship between the multivariate complexities of data and analytics workflows and the software trend toward microservices as a means of addressing the challenges of organizational communication patterns in the software lifecycle? How does this organizational complexity play out between multiple teams? For example between centralized data team and line of business self service teams? Isn’t organizational complexity just ‘the way it is’? Is there any how in getting out of meetings and inter team conflict? What are some of the technical elements that are most impactful in reducing the time to delivery for different roles? What are some strategies that you have found to be useful for maintaining a connection to the business need throughout the different stages of the data lifecycle? What are some of the signs or symptoms of problematic complexity that individuals and organizations should keep an eye out for? What role can automated testing play in improving this process? How do the current set of tools contribute to the fragmentation of data workflows? Which set of technologies are most valuable in reducing complexity and fragmentation? What advice do you have for data engineers to help with addressing complexity in the data organization and the problems that it contributes to? Contact Info LinkedIn @ChrisBergh on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links DataKitchen DataOps NASA Ames Research Center Excel Tableau Looker Podcast Episode Alteryx Trifacta Paxata AutoML Informatica SAS Conway’s Law Random Forest K-Means Clustering GraphQL Microservices Intuit Superglue Amundsen Podcast Episode Master Data Management Podcast Episode Hadoop Great Expectations Podcast Episode Observability Continuous Integration Continuous Delivery W. Edwards Deming The Joel Test Joel Spolsky DataOps Blog The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Apr 20, 2020 • 51min

Building Real Time Applications On Streaming Data With Eventador

Summary Modern applications frequently require access to real-time data, but building and maintaining the systems that make that possible is a complex and time consuming endeavor. Eventador is a managed platform designed to let you focus on using the data that you collect, without worrying about how to make it reliable. In this episode Eventador Founder and CEO Kenny Gorman describes how the platform is architected, the challenges inherent to managing reliable streams of data, the simplicity offered by a SQL interface, and the interesting projects that his customers have built on top of it. This was an interesting inside look at building a business on top of open source stream processing frameworks and how to reduce the burden on end users. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Your host is Tobias Macey and today I’m interviewing Kenny Gorman about the Eventador streaming SQL platform Interview Introduction How did you get involved in the area of data management? Can you start by describing what the Eventador platform is and the story behind it? How has your experience at ObjectRocket influenced your approach to streaming SQL? How do the capabilities and developer experience of Eventador compare to other streaming SQL engines such as ksqlDB, Pulsar SQL, or Materialize? What are the main use cases that you are seeing people use for streaming SQL? How does it fit into an application architecture? What are some of the design changes in the different layers that are necessary to take advantage of the real time capabilities? Can you describe how the Eventador platform is architected? How has the system design evolved since you first began working on it? How has the overall landscape of streaming systems changed since you first began working on Eventador? If you were to start over today what would you do differently? What are some of the most interesting and challenging operational aspects of running your platform? What are some of the ways that you have modified or augmented the SQL dialect that you support? What is the tipping point for when SQL is insufficient for a given task and a user might want to leverage Flink? What is the workflow for developing and deploying different SQL jobs? How do you handle versioning of the queries and integration with the software development lifecycle? What are some data modeling considerations that users should be aware of? What are some of the sharp edges or design pitfalls that users should be aware of? What are some of the most interesting, innovative, or unexpected ways that you have seen your customers use your platform? What are some of the most interesting, unexpected, or challenging lessons that you have learned in the process of building and scaling Eventador? What do you have planned for the future of the platform? Contact Info LinkedIn Blog @kennygorman on Twitter kgorman on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Eventador Oracle DB Paypal EBay Semaphore MongoDB ObjectRocket RackSpace RethinkDB Apache Kafka Pulsar PostgreSQL Write-Ahead Log (WAL) ksqlDB Podcast Episode Pulsar SQL Materialize Podcast Episode PipelineDB Podcast Episode Apache Flink Podcast Episode Timely Dataflow FinTech == Financial Technology Anomaly Detection Network Security Materialized View Kubernetes Confluent Schema Registry Podcast Episode ANSI SQL Apache Calcite PostgreSQL User Defined Functions Change Data Capture Podcast Episode AWS Kinesis Uber AthenaX Netflix Keystone Ververica Rockset Podcast Episode Backpressure Keen.io The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

Apr 14, 2020 • 26min

Making Data Collection In Your Code Easy With Rookout

Summary The software applications that we build for our businesses are a rich source of data, but accessing and extracting that data is often a slow and error-prone process. Rookout has built a platform to separate the data collection process from the lifecycle of your code. In this episode, CTO Liran Haimovitch discusses the benefits of shortening the iteration cycle and bringing non-engineers into the process of identifying useful data. This was a great conversation about the importance of democratizing the work of data collection. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Your host is Tobias Macey and today I’m interviewing Liran Haimovitch, CTO of Rookout, about the business value of operations metrics and other dark data in your organization Interview Introduction How did you get involved in the area of data management? Can you start by describing the types of data that we typically collect for the systems operations context? What are some of the business questions that can be answered from these data sources? What are some of the considerations that developers and operations engineers need to be aware of when they are defining the collection points for system metrics and log messages? What are some effective strategies that you have found for including business stake holders in the process of defining these collection points? One of the difficulties in building useful analyses from any source of data is maintaining the appropriate context. What are some of the necessary metadata that should be maintained along with operational metrics? What are some of the shortcomings in the systems we design and use for operational data stores in terms of making the collected data useful for other purposes? How does the existing tooling need to be changed or augmented to simplify the collaboration between engineers and stake holders for defining and collecting the needed information? The types of systems that we use for collecting and analyzing operations metrics are often designed and optimized for different access patterns and data formats than those used for analytical and exploratory purposes. What are your thoughts on how to incorporate the collected metrics with behavioral data? What are some of the other sources of dark data that we should keep an eye out for in our organizations? Contact Info LinkedIn @Liran_Last on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Rookout Cybersecurity DevOps DataDog Graphite Elasticsearch Logz.io Kafka The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Support Data Engineering Podcast

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!

App store banner

Play store banner