

Data Engineering Podcast
Tobias Macey
This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.
Episodes
Mentioned books

Apr 14, 2020 • 26min
Making Data Collection In Your Code Easy With Rookout
Summary
The software applications that we build for our businesses are a rich source of data, but accessing and extracting that data is often a slow and error-prone process. Rookout has built a platform to separate the data collection process from the lifecycle of your code. In this episode, CTO Liran Haimovitch discusses the benefits of shortening the iteration cycle and bringing non-engineers into the process of identifying useful data. This was a great conversation about the importance of democratizing the work of data collection.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
Your host is Tobias Macey and today I’m interviewing Liran Haimovitch, CTO of Rookout, about the business value of operations metrics and other dark data in your organization
Interview
Introduction
How did you get involved in the area of data management?
Can you start by describing the types of data that we typically collect for the systems operations context?
What are some of the business questions that can be answered from these data sources?
What are some of the considerations that developers and operations engineers need to be aware of when they are defining the collection points for system metrics and log messages?
What are some effective strategies that you have found for including business stake holders in the process of defining these collection points?
One of the difficulties in building useful analyses from any source of data is maintaining the appropriate context. What are some of the necessary metadata that should be maintained along with operational metrics?
What are some of the shortcomings in the systems we design and use for operational data stores in terms of making the collected data useful for other purposes?
How does the existing tooling need to be changed or augmented to simplify the collaboration between engineers and stake holders for defining and collecting the needed information?
The types of systems that we use for collecting and analyzing operations metrics are often designed and optimized for different access patterns and data formats than those used for analytical and exploratory purposes. What are your thoughts on how to incorporate the collected metrics with behavioral data?
What are some of the other sources of dark data that we should keep an eye out for in our organizations?
Contact Info
LinkedIn
@Liran_Last on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
Rookout
Cybersecurity
DevOps
DataDog
Graphite
Elasticsearch
Logz.io
Kafka
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Apr 7, 2020 • 45min
Building A Knowledge Graph Of Commercial Real Estate At Cherre
Summary
Knowledge graphs are a data resource that can answer questions beyond the scope of traditional data analytics. By organizing and storing data to emphasize the relationship between entities, we can discover the complex connections between multiple sources of information. In this episode John Maiden talks about how Cherre builds knowledge graphs that provide powerful insights for their customers and the engineering challenges of building a scalable graph. If you’re wondering how to extract additional business value from existing data, this episode will provide a way to expand your data resources.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on great conferences. We have partnered with organizations such as ODSC, and Data Council. Upcoming events include ODSC East which has gone virtual starting April 16th. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
Your host is Tobias Macey and today I’m interviewing John Maiden about how Cherre is building and using a knowledge graph of commercial real estate information
Interview
Introduction
How did you get involved in the area of data management?
Can you start by describing what Cherre is and the role that data plays in the business?
What are the benefits of a knowledge graph for making real estate investment decisions?
What are the main ways that you and your customers are using the knowledge graph?
What are some of the challenges that you face in providing a usable interface for end-users to query the graph?
What technology are you using for storing and processing the graph?
What challenges do you face in scaling the complexity and analysis of the graph?
What are the main sources of data for the knowledge graph?
What are some of the ways that messiness manifests in the data that you are using to populate the graph?
How are you managing cleaning of the data and how do you identify and process records that can’t be coerced into the desired structure?
How do you handle missing attributes or extra attributes in a given record?
How did you approach the process of determining an effective taxonomy for records in the graph?
What is involved in performing entity extraction on your data?
What are some of the most interesting or unexpected questions that you have been able to ask and answer with the graph?
What are some of the most interesting/unexpected/challenging lessons that you have learned in the process of working with this data?
What are some of the near and medium term improvements that you have planned for your knowledge graph?
What advice do you have for anyone who is interested in building a knowledge graph of their own?
Contact Info
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
Cherre
Commercial Real Estate
Knowledge Graph
RDF Triple
DGraph
Podcast Interview
Neo4J
TigerGraph
Google BigQuery
Apache Spark
Spark In Action Episode
Entity Extraction/Named Entity Recognition
NetworkX
Spark Graph Frames
Graph Embeddings
Airflow
Podcast.__init__ Interview
DBT
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Mar 30, 2020 • 45min
The Life Of A Non-Profit Data Professional
Summary
Building and maintaining a system that integrates and analyzes all of the data for your organization is a complex endeavor. Operating on a shoe-string budget makes it even more challenging. In this episode Tyler Colby shares his experiences working as a data professional in the non-profit sector. From managing Salesforce data models to wrangling a multitude of data sources and compliance challenges, he describes the biggest challenges that he is facing.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on great conferences. We have partnered with organizations such as ODSC, and Data Council. Upcoming events include the Observe 20/20 virtual conference and ODSC East which has also gone virtual. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
Your host is Tobias Macey and today I’m interviewing Tyler Colby about his experiences working as a data professional in the non-profit arena, most recently at the Natural Resources Defense Council
Interview
Introduction
How did you get involved in the area of data management?
Can you start by describing your responsibilities as the director of data infrastructure at the NRDC?
What specific challenges are you facing at the NRDC?
Can you describe some of the types of data that you are working with at the NRDC?
What types of systems are you relying on for the source of your data?
What kinds of systems have you put in place to manage the data needs of the NRDC?
What are your biggest influences in the build vs. buy decisions that you make?
What heuristics or guidelines do you rely on for aligning your work with the business value that it will produce and the broader mission of the organization?
Have you found there to be any extra scrutiny of your work as a member of a non-profit in terms of regulations or compliance questions?
Your career has involved a significant focus on the Salesforce platform. For anyone not familiar with it, what benefits does it provide in managing information flows and analysis capabilities?
What are some of the most challenging or complex aspects of working with Saleseforce?
In light of the current global crisis posed by COVID-19 you have established a new non-profit entity to organize the efforts of various technical professionals. Can you describe the nature of that mission?
What are some of the unique data challenges that you anticipate or have already encountered?
How do the data challenges of this new organization compare to your past experiences?
What have you found to be most useful or beneficial in the current landscape of data management systems and practices in your career with non-profit organizations?
What are the areas that need to be addressed or improved for workers in the non-profit sector?
Contact Info
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
NRDC
AWS Redshift
Time Warner Cable
Salesforce
Cloud For Good
Tableau
Civis Analytics
EveryAction
BlackBaud
ActionKit
MobileCommons
XKCD 1667
GDPR == General Data Privacy Regulation
CCPA == California Consumer Privacy Act
Salesforce Apex
Salesforce.org
Salesforce Non-Profit Success Pack
Validity
OpenRefine
JitterBit
Skyvia
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Mar 23, 2020 • 36min
Behind The Scenes Of The Linode Object Storage Service
Summary
There are a number of platforms available for object storage, including self-managed open source projects. But what goes on behind the scenes of the companies that run these systems at scale so you don’t have to? In this episode Will Smith shares the journey that he and his team at Linode recently completed to bring a fast and reliable S3 compatible object storage to production for your benefit. He discusses the challenges of running object storage for public usage, some of the interesting ways that it was stress tested internally, and the lessons that he learned along the way.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
Your host is Tobias Macey and today I’m interviewing Will Smith about his work on building object storage for the Linode cloud platform
Interview
Introduction
How did you get involved in the area of data management?
Can you start by giving an overview of the current state of your object storage product?
What was the motivating factor for building and managing your own object storage system rather than building an integration with another offering such as Wasabi or Backblaze?
What is the scale and scope of usage that you had to design for?
Can you describe how your platform is implemented?
What was your criteria for deciding whether to use an available platform such as Ceph or MinIO vs building your own from scratch?
How have your initial assumptions about the operability and maintainability of your installation been challenged or updated since it has been released to the public?
What have been the biggest challenges that you have faced in designing and deploying a system that can meet the scale and reliability requirements of Linode?
What are the most important capabilities for the underlying hardware that you are running on?
What supporting systems and tools are you using to manage the availability and durability of your object storage?
How did you approach the rollout of Linode’s object storage to gain the confidence that you needed to feel comfortable with full scale usage?
What are some of the benefits that you have gained internally at Linode from having an object storage system available to your product teams?
What are your thoughts on the state of the S3 API as a de facto standard for object storage?
What is your main focus now that object storage is being rolled out to more data centers?
Contact Info
Dorthu on GitHub
dorthu22 on Twitter
LinkedIn
Website
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Linode Object Storage
Xen Hypervisor
KVM (Linux Kernel Virtual Machine)
Linode API V4
Ceph Distributed Filesystem
Podcast Episode
Wasabi
Backblaze
MinIO
CERN Ceph Scaling Paper
RADOS Gateway
OpenResty
Lua
Prometheus
Linode Managed Kubernetes
Ceph Swift Protocol
Ceph Bug Tracker
Linode Dashboard Application Source Code
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Mar 17, 2020 • 55min
Building A New Foundation For CouchDB
Summary
CouchDB is a distributed document database built for scale and ease of operation. With a built-in synchronization protocol and a HTTP interface it has become popular as a backend for web and mobile applications. Created 15 years ago, it has accrued some technical debt which is being addressed with a refactored architecture based on FoundationDB. In this episode Adam Kocoloski shares the history of the project, how it works under the hood, and how the new design will improve the project for our new era of computation. This was an interesting conversation about the challenges of maintaining a large and mission critical project and the work being done to evolve it.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
Are you spending too much time maintaining your data pipeline? Snowplow empowers your business with a real-time event data pipeline running in your own cloud account without the hassle of maintenance. Snowplow takes care of everything from installing your pipeline in a couple of hours to upgrading and autoscaling so you can focus on your exciting data projects. Your team will get the most complete, accurate and ready-to-use behavioral web and mobile data, delivered into your data warehouse, data lake and real-time streams. Go to dataengineeringpodcast.com/snowplow today to find out why more than 600,000 websites run Snowplow. Set up a demo and mention you’re a listener for a special offer!
Setting up and managing a data warehouse for your business analytics is a huge task. Integrating real-time data makes it even more challenging, but the insights you obtain can make or break your business growth. You deserve a data warehouse engine that outperforms the demands of your customers and simplifies your operations at a fraction of the time and cost that you might expect. You deserve ClickHouse, the open-source analytical database that deploys and scales wherever and whenever you want it to and turns data into actionable insights. And Altinity, the leading software and service provider for ClickHouse, is on a mission to help data engineers and DevOps managers tame their operational analytics. Go to dataengineeringpodcast.com/altinity for a free consultation to find out how they can help you today.
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
Your host is Tobias Macey and today I’m interviewing Adam Kocoloski about CouchDB and the work being done to migrate the storage layer to FoundationDB
Interview
Introduction
How did you get involved in the area of data management?
Can you starty by describing what CouchDB is?
How did you get involved in the CouchDB project and what is your current role in the community?
What are the use cases that it is well suited for?
Can you share some of the history of CouchDB and its role in the NoSQL movement?
How is CouchDB currently architected and how has it evolved since it was first introduced?
What have been the benefits and challenges of Erlang as the runtime for CouchDB?
How is the current storage engine implemented and what are its shortcomings?
What problems are you trying to solve by replatforming on a new storage layer?
What were the selection criteria for the new storage engine and how did you structure the decision making process?
What was the motivation for choosing FoundationDB as opposed to other options such as rocksDB, levelDB, etc.?
How is the adoption of FoundationDB going to impact the overall architecture and implementation of CouchDB?
How will the use of FoundationDB impact the way that the current capabilities are implemented, such as data replication?
What will the migration path be for people running an existing installation?
What are some of the biggest challenges that you are facing in rearchitecting the codebase?
What new capabilities will the FoundationDB storage layer enable?
What are some of the most interesting/unexpected/innovative ways that you have seen CouchDB used?
What new capabilities or use cases do you anticipate once this migration is complete?
What are some of the most interesting/unexpected/challenging lessons that you have learned while working with the CouchDB project and community?
What is in store for the future of CouchDB?
Contact Info
LinkedIn
@kocolosk on Twitter
kocolosk on GitHub
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Apache CouchDB
FoundationDB
Podcast Episode
IBM
Cloudant
Experimental Particle Physics
FPGA == Field Programmable Gate Array
Apache Software Foundation
CRDT == Conflict-free Replicated Data Type
Podcast Episode
Erlang
Riak
RabbitMQ
Heisenbug
Kubernetes
Property Based Testing
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Mar 9, 2020 • 54min
Scaling Data Governance For Global Businesses With A Data Hub Architecture
Summary
Data governance is a complex endeavor, but scaling it to meet the needs of a complex or globally distributed organization requires a well considered and coherent strategy. In this episode Tim Ward describes an architecture that he has used successfully with multiple organizations to scale compliance. By treating it as a graph problem, where each hub in the network has localized control with inheritance of higher level controls it reduces overhead and provides greater flexibility. Tim provides useful examples for understanding how to adopt this approach in your own organization, including some technology recommendations for making it maintainable and scalable. If you are struggling to scale data quality controls and governance requirements then this interview will provide some useful ideas to incorporate into your roadmap.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
Your host is Tobias Macey and today I’m interviewing Tim Ward about using an architectural pattern called data hub that allows for scaling data management across global businesses
Interview
Introduction
How did you get involved in the area of data management?
Can you start by giving an overview of the goals of a data hub architecture?
What are the elements of a data hub architecture and how do they contribute to the overall goals?
What are some of the patterns or reference architectures that you drew on to develop this approach?
What are some signs that an organization should implement a data hub architecture?
What is the migration path for an organization who has an existing data platform but needs to scale their governance and localize storage and access?
What are the features or attributes of an individual hub that allow for them to be interconnected?
What is the interface presented between hubs to allow for accessing information across these localized repositories?
What is the process for adding a new hub and making it discoverable across the organization?
How is discoverability of data managed within and between hubs?
If someone wishes to access information between hubs or across several of them, how do you prevent data proliferation?
If data is copied between hubs, how are record updates accounted for to ensure that they are replicated to the hubs that hold a copy of that entity?
How are access controls and data masking managed to ensure that various compliance regimes are honored?
In addition to compliance issues, another challenge of distributed data repositories is the question of latency. How do you mitigate the performance impacts of querying across multiple hubs?
Given that different hubs can have differing rules for quality, cleanliness, or structure of a given record how do you handle transformations of data as it traverses different hubs?
How do you address issues of data loss or corruption within those transformations?
How is the topology of a hub infrastructure arranged and how does that impact questions of data loss through multiple zone transformations, latency, etc.?
How do you manage tracking and reporting of data lineage within and across hubs?
For an organization that is interested in implementing their own instance of a data hub architecture, what are the necessary components of an individual hub?
What are some of the considerations and useful technologies that would assist in creating and connecting hubs?
Should the hubs be implmeneted in a homogeneous fashion, or is there room for heterogeneity in their infrastructure as long as they expose the appropriate interface?
When is a data hub architecture the wrong approach?
Contact Info
LinkedIn
@jerrong on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
CluedIn
Podcast Episode
Eventual Connectivity Episode
Futurama
Kubernetes
Zookeeper
Podcast Episode
Data Governance
Data Lineage
Data Sovereignty
Graph Database
Helm Chart
Application Container
Docker Compose
LinkedIn DataHub
Udemy
PluralSight
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Mar 2, 2020 • 44min
Easier Stream Processing On Kafka With ksqlDB
Summary
Building applications on top of unbounded event streams is a complex endeavor, requiring careful integration of multiple disparate systems that were engineered in isolation. The ksqlDB project was created to address this state of affairs by building a unified layer on top of the Kafka ecosystem for stream processing. Developers can work with the SQL constructs that they are familiar with while automatically getting the durability and reliability that Kafka offers. In this episode Michael Drogalis, product manager for ksqlDB at Confluent, explains how the system is implemented, how you can use it for building your own stream processing applications, and how it fits into the lifecycle of your data infrastructure. If you have been struggling with building services on low level streaming interfaces then give this episode a listen and try it out for yourself.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
Are you spending too much time maintaining your data pipeline? Snowplow empowers your business with a real-time event data pipeline running in your own cloud account without the hassle of maintenance. Snowplow takes care of everything from installing your pipeline in a couple of hours to upgrading and autoscaling so you can focus on your exciting data projects. Your team will get the most complete, accurate and ready-to-use behavioral web and mobile data, delivered into your data warehouse, data lake and real-time streams. Go to dataengineeringpodcast.com/snowplow today to find out why more than 600,000 websites run Snowplow. Set up a demo and mention you’re a listener for a special offer!
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
Your host is Tobias Macey and today I’m interviewing Michael Drogalis about ksqlDB, the open source streaming database layer for Kafka
Interview
Introduction
How did you get involved in the area of data management?
Can you start by describing what ksqlDB is?
What are some of the use cases that it is designed for?
How do the capabilities and design of ksqlDB compare to other solutions for querying streaming data with SQL such as Pulsar SQL, PipelineDB, or Materialize?
What was the motivation for building a unified project for providing a database interface on the data stored in Kafka?
How is ksqlDB architected?
If you were to rebuild the entire platform and its components from scratch today, what would you do differently?
What is the workflow for an analyst or engineer to design and build an application on top of ksqlDB?
What dialect of SQL is supported?
What kinds of extensions or built in functions have been added to aid in the creation of streaming queries?
How are table schemas defined and enforced?
How do you handle schema migrations on active streams?
Typically a database is considered a long term storage location for data, whereas Kafka is a streaming layer with a bounded amount of durable storage. What is a typical lifecycle of information in ksqlDB?
Can you talk through an example architecture that might incorporate ksqlDB including the source systems, applications that might interact with the data in transit, and any destinations sytems for long term persistence?
What are some of the less obvious features of ksqlDB or capabilities that you think should be more widely publicized?
What are some of the edge cases or potential pitfalls that users should be aware of as they are designing their streaming applications?
What is involved in deploying and maintaining an installation of ksqlDB?
What are some of the operational characteristics of the system that should be considered while planning an installation such as scaling factors, high availability, or potential bottlenecks in the architecture?
When is ksqlDB the wrong choice?
What are some of the most interesting/unexpected/innovative projects that you have seen built with ksqlDB?
What are some of the most interesting/unexpected/challenging lessons that you have learned while working on ksqlDB?
What is in store for the future of the project?
Contact Info
@michaeldrogalis on Twitter
michaeldrogalis on GitHub
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
ksqlDB
Confluent
Erlang
Onyx
Apache Storm
Stream Processing
Kafka
ksql
Kafka Streams
Pulsar
Podcast Episode
Pulsar SQL
PipelineDB
Podcast Episode
Materialize
Podcast Episode
Kafka Connect
RocksDB
Java Jar
CLI == Command Line Interface
PrestoDB
Podcast Episode
ANSI SQL
Pravega
Podcast Episode
Eventual Consistency
Confluent Cloud
MySQL
PostgreSQL
GraphQL
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Feb 25, 2020 • 46min
Shining A Light on Shadow IT In Data And Analytics
Summary
Misaligned priorities across business units can lead to tensions that drive members of the organization to build data and analytics projects without the guidance or support of engineering or IT staff. The availability of cloud platforms and managed services makes this a viable option, but can lead to downstream challenges. In this episode Sean Knapp and Charlie Crocker share their experiences of working in and with companies that have dealt with shadow IT projects and the importance of enabling and empowering the use and exploration of data and analytics. If you have ever been frustrated by seemingly draconian policies or struggled to align everyone on your supported platform, then this episode will help you gain some perspective and set you on a path to productive collaboration.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
Are you spending too much time maintaining your data pipeline? Snowplow empowers your business with a real-time event data pipeline running in your own cloud account without the hassle of maintenance. Snowplow takes care of everything from installing your pipeline in a couple of hours to upgrading and autoscaling so you can focus on your exciting data projects. Your team will get the most complete, accurate and ready-to-use behavioral web and mobile data, delivered into your data warehouse, data lake and real-time streams. Go to dataengineeringpodcast.com/snowplow today to find out why more than 600,000 websites run Snowplow. Set up a demo and mention you’re a listener for a special offer!
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
Your host is Tobias Macey and today I’m interviewing Sean Knapp, Charlie Crocker about shadow IT in data and analytics
Interview
Introduction
How did you get involved in the area of data management?
Can you start by sharing your definition of shadow IT?
What are some of the reasons that members of an organization might start building their own solutions outside of what is supported by the engineering teams?
What are some of the roles in an organization that you have seen involved in these shadow IT projects?
What kinds of tools or platforms are well suited for being provisioned and managed without involvement from the platform team?
What are some of the pitfalls that these solutions present as a result of their initial ease of use?
What are the benefits to the organization of individuals or teams building and managing their own solutions?
What are some of the risks associated with these implementations of data collection, storage, management, or analysis that have no oversight from the teams typically tasked with managing those systems?
What are some of the ways that compliance or data quality issues can arise from these projects?
Once a project has been started outside of the approved channels it can quickly take on a life of its own. What are some of the ways you have identified the presence of "unauthorized" data projects?
Once you have identified the existence of such a project how can you revise their implementation to integrate them with the "approved" platform that the organization supports?
What are some strategies for removing the friction in the collection, access, or availability of data in an organization that can eliminate the need for shadow IT implementations?
What are some of the inherent complexities in data management which you would like to see resolved in order to reduce the tensions that lead to these bespoke solutions?
Contact Info
Sean
LinkedIn
@seanknapp on Twitter
Charlie
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Shadow IT
Ascend
Podcast Episode
ZoneHaven
Google Sawzall
M&A == Mergers and Acquisitions
DevOps
Waterfall Development
Data Governance
Data Lineage
Pioneers, Settlers, and Town Planners
PowerBI
Tableau
Excel
Amundsen
Podcast Episode
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Feb 18, 2020 • 49min
Data Infrastructure Automation For Private SaaS At Snowplow
Summary
One of the biggest challenges in building reliable platforms for processing event pipelines is managing the underlying infrastructure. At Snowplow Analytics the complexity is compounded by the need to manage multiple instances of their platform across customer environments. In this episode Josh Beemster, the technical operations lead at Snowplow, explains how they manage automation, deployment, monitoring, scaling, and maintenance of their streaming analytics pipeline for event data. He also shares the challenges they face in supporting multiple cloud environments and the need to integrate with existing customer systems. If you are daunted by the needs of your data infrastructure then it’s worth listening to how Josh and his team are approaching the problem.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
Your host is Tobias Macey and today I’m interviewing Josh Beemster about how Snowplow manages deployment and maintenance of their managed service in their customer’s cloud accounts.
Interview
Introduction
How did you get involved in the area of data management?
Can you start by giving an overview of the components in your system architecture and the nature of your managed service?
What are some of the challenges that are inherent to private SaaS nature of your managed service?
What elements of your system require the most attention and maintenance to keep them running properly?
Which components in the pipeline are most subject to variability in traffic or resource pressure and what do you do to ensure proper capacity?
How do you manage deployment of the full Snowplow pipeline for your customers?
How has your strategy for deployment evolved since you first began Soffering the managed service?
How has the architecture of the pipeline evolved to simplify operations?
How much customization do you allow for in the event that the customer has their own system that they want to use in place of one of your supported components?
What are some of the common difficulties that you encounter when working with customers who need customized components, topologies, or event flows?
How does that reflect in the tooling that you use to manage their deployments?
What types of metrics do you track and what do you use for monitoring and alerting to ensure that your customers pipelines are running smoothly?
What are some of the most interesting/unexpected/challenging lessons that you have learned in the process of working with and on Snowplow?
What are some lessons that you can generalize for management of data infrastructure more broadly?
If you could start over with all of Snowplow and the infrastructure automation for it today, what would you do differently?
What do you have planned for the future of the Snowplow product and infrastructure management?
Contact Info
LinkedIn
jbeemster on GitHub
@jbeemster1 on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Closing Announcements
Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used.
Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes.
If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story.
To help other people find the show please leave a review on iTunes and tell your friends and co-workers
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Links
Snowplow Analytics
Podcast Episode
Terraform
Consul
Nomad
Meltdown Vulnerability
Spectre Vulnerability
AWS Kinesis
Elasticsearch
SnowflakeDB
Indicative
S3
Segment
AWS Cloudwatch
Stackdriver
Apache Kafka
Apache Pulsar
Google Cloud PubSub
AWS SQS
AWS SNS
AWS Redshift
Ansible
AWS Cloudformation
Kubernetes
AWS EMR
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Feb 9, 2020 • 1h 6min
Data Modeling That Evolves With Your Business Using Data Vault
Summary
Designing the structure for your data warehouse is a complex and challenging process. As businesses deal with a growing number of sources and types of information that they need to integrate, they need a data modeling strategy that provides them with flexibility and speed. Data Vault is an approach that allows for evolving a data model in place without requiring destructive transformations and massive up front design to answer valuable questions. In this episode Kent Graziano shares his journey with data vault, explains how it allows for an agile approach to data warehousing, and explains the core principles of how to use it. If you’re struggling with unwieldy dimensional models, slow moving projects, or challenges integrating new data sources then listen in on this conversation and then give data vault a try for yourself.
Announcements
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show!
Setting up and managing a data warehouse for your business analytics is a huge task. Integrating real-time data makes it even more challenging, but the insights you obtain can make or break your business growth. You deserve a data warehouse engine that outperforms the demands of your customers and simplifies your operations at a fraction of the time and cost that you might expect. You deserve Clickhouse, the open source analytical database that deploys and scales wherever and whenever you want it to and turns data into actionable insights. And Altinity, the leading software and service provider for Clickhouse, is on a mission to help data engineers and DevOps managers tame their operational analytics. Go to dataengineeringpodcast.com/altinity for a free consultation to find out how they can help you today.
You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today.
Your host is Tobias Macey and today I’m interviewing Kent Graziano about data vault modeling and the role that it plays in the current data landscape
Interview
Introduction
How did you get involved in the area of data management?
Can you start by giving an overview of what data vault modeling is and how it differs from other approaches such as third normal form or the star/snowflake schema?
What is the history of this approach and what limitations of alternate styles of modeling is it attempting to overcome?
How did you first encounter this approach to data modeling and what is your motivation for dedicating so much time and energy to promoting it?
What are some of the primary challenges associated with data modeling that contribute to the long lead times for data requests or outright project Datafailure?
What are some of the foundational skills and knowledge that are necessary for effective modeling of data warehouses?
How has the era of data lakes, unstructured/semi-structured data, and non-relational storage engines impacted the state of the art in data modeling?
Is there any utility in data vault modeling in a data lake context (S3, Hadoop, etc.)?
What are the steps for establishing and evolving a data vault model in an organization?
How does that approach scale from one to many data sources and their varying lifecycles of schema changes and data loading?
What are some of the changes in query structure that consumers of the model will need to plan for?
Are there any performance or complexity impacts imposed by the data vault approach?
Can you talk through the overall lifecycle of data in a data vault modeled warehouse?
How does that compare to approaches such as audit/history tables in transaction databases or slowly changing dimensions in a star or snowflake model?
What are some cases where a data vault approach doesn’t fit the needs of an organization or application?
For listeners who want to learn more, what are some references or exercises that you recommend?
Contact Info
Website
LinkedIn
@KentGraziano on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
SnowflakeDB
Data Vault Modeling
Data Warrior Blog
OLTP == On-Line Transaction Processing
Data Warehouse
Bill Inmon
Claudia Imhoff
Oracle DB
Third Normal Form
Star Schema
Snowflake Schema
Relational Theory
Sixth Normal Form
Denormalization
Pivot Table
Dan Linstedt
TDAN.com
Ralph Kimball
Agile Manifesto
Schema On Read
Data Lake
Hadoop
NoSQL
Data Vault Conference
Teradata
ODS (Operational Data Store) Model
Supercharge Your Data Warehouse (affiliate link)
Building A Scalable Data Warehouse With Data Vault 2.0 (affiliate link)
Data Model Resource Book (affiliate link)
Data Warehouse Toolkit (affiliate link)
Building The Data Warehouse (affiliate link)
Dan Linstedt Blog
Perforrmance G2
Scale Free European Classes
Certus Australian Classes
Wherescape
Erwin
VaultSpeed
Data Vault Builder
Varigence BimlFlex
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast