

Data Engineering Podcast
Tobias Macey
This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.
Episodes
Mentioned books

Oct 22, 2018 • 46min
Of Checklists, Ethics, and Data with Emily Miller and Peter Bull (Cross Post from Podcast.__init__) - Episode 53
Summary
As data science becomes more widespread and has a bigger impact on the lives of people, it is important that those projects and products are built with a conscious consideration of ethics. Keeping ethical principles in mind throughout the lifecycle of a data project helps to reduce the overall effort of preventing negative outcomes from the use of the final product. Emily Miller and Peter Bull of Driven Data have created Deon to improve the communication and conversation around ethics among and between data teams. It is a Python project that generates a checklist of common concerns for data oriented projects at the various stages of the lifecycle where they should be considered. In this episode they discuss their motivation for creating the project, the challenges and benefits of maintaining such a checklist, and how you can start using it today.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
This is your host Tobias Macey and this week I am sharing an episode from my other show, Podcast.__init__, about a project from Driven Data called Deon. It is a simple tool that generates a checklist of ethical considerations for the various stages of the lifecycle for data oriented projects. This is an important topic for all of the teams involved in the management and creation of projects that leverage data. So give it a listen and if you like what you hear, be sure to check out the other episodes at pythonpodcast.com
Interview
Introductions
How did you get introduced to Python?
Can you start by describing what Deon is and your motivation for creating it?
Why a checklist, specifically? What’s the advantage of this over an oath, for example?
What is unique to data science in terms of the ethical concerns, as compared to traditional software engineering?
What is the typical workflow for a team that is using Deon in their projects?
Deon ships with a default checklist but allows for customization. What are some common addendums that you have seen?
Have you received pushback on any of the default items?
How does Deon simplify communication around ethics across team boundaries?
What are some of the most often overlooked items?
What are some of the most difficult ethical concerns to comply with for a typical data science project?
How has Deon helped you at Driven Data?
What are the customer facing impacts of embedding a discussion of ethics in the product development process?
Some of the items on the default checklist coincide with regulatory requirements. Are there any cases where regulation is in conflict with an ethical concern that you would like to see practiced?
What are your hopes for the future of the Deon project?
Keep In Touch
Emily
LinkedIn
ejm714 on GitHub
Peter
LinkedIn
@pjbull on Twitter
pjbull on GitHub
Driven Data
@drivendataorg on Twitter
drivendataorg on GitHub
Website
Picks
Tobias
Richard Bond Glass Art
Emily
Tandem Coffee in Portland, Maine
Peter
The Model Bakery in Saint Helena and Napa, California
Links
Deon
Driven Data
International Development
Brookings Institution
Stata
Econometrics
Metis Bootcamp
Pandas
Podcast Episode
C#
.NET
Podcast.__init__ Episode On Software Ethics
Jupyter Notebook
Podcast Episode
Word2Vec
cookiecutter data science
Logistic Regression
The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast

Oct 15, 2018 • 54min
Improving The Performance Of Cloud-Native Big Data At Netflix Using The Iceberg Table Format with Ryan Blue - Episode 52
Summary
With the growth of the Hadoop ecosystem came a proliferation of implementations for the Hive table format. Unfortunately, with no formal specification, each project works slightly different which increases the difficulty of integration across systems. The Hive format is also built with the assumptions of a local filesystem which results in painful edge cases when leveraging cloud object storage for a data lake. In this episode Ryan Blue explains how his work on the Iceberg table format specification and reference implementation has allowed Netflix to improve the performance and simplify operations for their S3 data lake. This is a highly detailed and technical exploration of how a well-engineered metadata layer can improve the speed, accuracy, and utility of large scale, multi-tenant, cloud-native data platforms.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Your host is Tobias Macey and today I’m interviewing Ryan Blue about Iceberg, a Netflix project to implement a high performance table format for batch workloads
Interview
Introduction
How did you get involved in the area of data management?
Can you start by explaining what Iceberg is and the motivation for creating it?
Was the project built with open-source in mind or was it necessary to refactor it from an internal project for public use?
How has the use of Iceberg simplified your work at Netflix?
How is the reference implementation architected and how has it evolved since you first began work on it?
What is involved in deploying it to a user’s environment?
For someone who is interested in using Iceberg within their own environments, what is involved in integrating it with their existing query engine?
Is there a migration path for pre-existing tables into the Iceberg format?
How is schema evolution managed at the file level?
How do you handle files on disk that don’t contain all of the fields specified in a table definition?
One of the complicated problems in data modeling is managing table partitions. How does Iceberg help in that regard?
What are the unique challenges posed by using S3 as the basis for a data lake?
What are the benefits that outweigh the difficulties?
What have been some of the most challenging or contentious details of the specification to define?
What are some things that you have explicitly left out of the specification?
What are your long-term goals for the Iceberg specification?
Do you anticipate the reference implementation continuing to be used and maintained?
Contact Info
rdblue on GitHub
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Iceberg Reference Implementation
Iceberg Table Specification
Netflix
Hadoop
Cloudera
Avro
Parquet
Spark
S3
HDFS
Hive
ORC
S3mper
Git
Metacat
Presto
Pig
DDL (Data Definition Language)
Cost-Based Optimization
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast

Oct 9, 2018 • 57min
Combining Transactional And Analytical Workloads On MemSQL with Nikita Shamgunov
SummaryOne of the most complex aspects of managing data for analytical workloads is moving it from a transactional database into the data warehouse. What if you didn’t have to do that at all? MemSQL is a distributed database built to support concurrent use by transactional, application oriented, and analytical, high volume, workloads on the same hardware. In this episode the CEO of MemSQL describes how the company and database got started, how it is architected for scale and speed, and how it is being used in production. This was a deep dive on how to build a successful company around a powerful platform, and how that platform simplifies operations for enterprise grade data management.PreambleHello and welcome to the Data Engineering Podcast, the show about modern data managementWhen you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.You work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science.And the team at Metis Machine has shipped a proof-of-concept integration between the Skafos machine learning platform and the Tableau business intelligence tool, meaning that your BI team can now run the machine learning models custom built by your data science team. If you think that sounds awesome (and it is) then join the free webinar with Metis Machine on October 11th at 2 PM ET (11 AM PT). Metis Machine will walk through the architecture of the extension, demonstrate its capabilities in real time, and illustrate the use case for empowering your BI team to modify and run machine learning models directly from Tableau. Go to metismachine.com/webinars now to register.Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chatYour host is Tobias Macey and today I’m interviewing Nikita Shamgunov about MemSQL, a newSQL database built for simultaneous transactional and analytic workloadsInterviewIntroductionHow did you get involved in the area of data management?Can you start by describing what MemSQL is and how the product and business first got started?What are the typical use cases for customers running MemSQL?What are the benefits of integrating the ingestion pipeline with the database engine? What are some typical ways that the ingest capability is leveraged by customers?How is MemSQL architected and how has the internal design evolved from when you first started working on it?Where does it fall on the axes of the CAP theorem?How much processing overhead is involved in the conversion from the column oriented data stored on disk to the row oriented data stored in memory?Can you describe the lifecycle of a write transaction?Can you discuss the techniques that are used in MemSQL to optimize for speed and overall system performance?How do you mitigate the impact of network latency throughout the cluster during query planning and execution?How much of the implementation of MemSQL is using custom built code vs. open source projects?What are some of the common difficulties that your customers encounter when building on top of or migrating to MemSQL?What have been some of the most challenging aspects of building and growing the technical and business implementation of MemSQL?When is MemSQL the wrong choice for a data platform?What do you have planned for the future of MemSQL?Contact Info@nikitashamgunov on TwitterLinkedInParting QuestionFrom your perspective, what is the biggest gap in the tooling or technology for data management today?LinksMemSQLNewSQLMicrosoft SQL ServerSt. Petersburg University of Fine Mechanics And OpticsCC++In-Memory DatabaseRAM (Random Access Memory)Flash StorageOracle DBPostgreSQLPodcast EpisodeKafkaKinesisWealth ManagementData WarehouseODBCS3HDFSAvroParquetData Serialization Podcast EpisodeBroadcast JoinShuffle JoinCAP TheoremApache ArrowLZ4S2 Geospatial LibrarySybaseSAP HanaKubernetesThe intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

Oct 1, 2018 • 53min
Building A Knowledge Graph From Public Data At Enigma With Chris Groskopf - Episode 50
Summary
There are countless sources of data that are publicly available for use. Unfortunately, combining those sources and making them useful in aggregate is a time consuming and challenging process. The team at Enigma builds a knowledge graph for use in your own data projects. In this episode Chris Groskopf explains the platform they have built to consume large varieties and volumes of public data for constructing a graph for serving to their customers. He discusses the challenges they are facing to scale the platform and engineering processes, as well as the workflow that they have established to enable testing of their ETL jobs. This is a great episode to listen to for ideas on how to organize a data engineering organization.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
You work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Your host is Tobias Macey and today I’m interviewing Chris Groskopf about Enigma and how the are using public data sources to build a knowledge graph
Interview
Introduction
How did you get involved in the area of data management?
Can you give a brief overview of what Enigma has built and what the motivation was for starting the company?
How do you define the concept of a knowledge graph?
What are the processes involved in constructing a knowledge graph?
Can you describe the overall architecture of your data platform and the systems that you use for storing and serving your knowledge graph?
What are the most challenging or unexpected aspects of building the knowledge graph that you have encountered?
How do you manage the software lifecycle for your ETL code?
What kinds of unit, integration, or acceptance tests do you run to ensure that you don’t introduce regressions in your processing logic?
What are the current challenges that you are facing in building and scaling your data infrastructure?
How does the fact that your data sources are primarily public influence your pipeline design and what challenges does it pose?
What techniques are you using to manage accuracy and consistency in the data that you ingest?
Can you walk through the lifecycle of the data that you process from acquisition through to delivery to your customers?
What are the weak spots in your platform that you are planning to address in upcoming projects?
If you were to start from scratch today, what would you have done differently?
What are some of the most interesting or unexpected uses of your product that you have seen?
What is in store for the future of Enigma?
Contact Info
Email
Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Enigma
Chicago Tribune
NPR
Quartz
CSVKit
Agate
Knowledge Graph
Taxonomy
Concourse
Airflow
Docker
S3
Data Lake
Parquet
Podcast Episode
Spark
AWS Neptune
AWS Batch
Money Laundering
Jupyter Notebook
Papermill
Jupytext
Cauldron: The Un-Notebook
Podcast.__init__ Episode
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast

Sep 24, 2018 • 50min
A Primer On Enterprise Data Curation with Todd Walter - Episode 49
Summary
As your data needs scale across an organization the need for a carefully considered approach to collection, storage, organization, and access becomes increasingly critical. In this episode Todd Walter shares his considerable experience in data curation to clarify the many aspects that are necessary for a successful platform for your business. Using the metaphor of a museum curator carefully managing the precious resources on display and in the vaults, he discusses the various layers of an enterprise data strategy. This includes modeling the lifecycle of your information as a pipeline from the raw, messy, loosely structured records in your data lake, through a series of transformations and ultimately to your data warehouse. He also explains which layers are useful for the different members of the business, and which pitfalls to look out for along the path to a mature and flexible data platform.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
You work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Your host is Tobias Macey and today I’m interviewing Todd Walter about data curation and how to architect your data systems to support high quality, maintainable intelligence
Interview
Introduction
How did you get involved in the area of data management?
How do you define data curation?
What are some of the high level concerns that are encapsulated in that effort?
How does the size and maturity of a company affect the ways that they architect and interact with their data systems?
Can you walk through the stages of an ideal lifecycle for data within the context of an organizations uses for it?
What are some of the common mistakes that are made when designing a data architecture and how do they lead to failure?
What has changed in terms of complexity and scope for data architecture and curation since you first started working in this space?
As “big data” became more widely discussed the common mantra was to store everything because you never know when you’ll need the data that might get thrown away. As the industry is reaching a greater degree of maturity and more regulations are implemented there has been a shift to being more considerate as to what information gets stored and for how long. What are your views on that evolution and what is your litmus test for determining which data to keep?
In terms of infrastructure, what are the components of a modern data architecture and how has that changed over the years?
What is your opinion on the relative merits of a data warehouse vs a data lake and are they mutually exclusive?
Once an architecture has been established, how do you allow for continued evolution to prevent stagnation and eventual failure?
ETL has long been the default approach for building and enforcing data architecture, but there have been significant shifts in recent years due to the emergence of streaming systems and ELT approaches in new data warehouses. What are your thoughts on the landscape for managing data flows and migration and when to use which approach?
What are some of the areas of data architecture and curation that are most often forgotten or ignored?
What resources do you recommend for anyone who is interested in learning more about the landscape of data architecture and curation?
Contact Info
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Teradata
Data Architecture
Data Curation
Data Warehouse
Chief Data Officer
ETL (Extract, Transform, Load)
Data Lake
Metadata
Data Lineage
Data Provenance
Strata Conference
ELT (Extract, Load, Transform)
Map-Reduce
Hive
Pig
Spark
Data Governance
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Support Data Engineering Podcast

Sep 17, 2018 • 48min
Take Control Of Your Web Analytics Using Snowplow With Alexander Dean - Episode 48
Summary
Every business with a website needs some way to keep track of how much traffic they are getting, where it is coming from, and which actions are being taken. The default in most cases is Google Analytics, but this can be limiting when you wish to perform detailed analysis of the captured data. To address this problem, Alex Dean co-founded Snowplow Analytics to build an open source platform that gives you total control of your website traffic data. In this episode he explains how the project and company got started, how the platform is architected, and how you can start using it today to get a clearer view of how your customers are interacting with your web and mobile applications.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
You work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
This is your host Tobias Macey and today I’m interviewing Alexander Dean about Snowplow Analytics
Interview
Introductions
How did you get involved in the area of data engineering and data management?
What is Snowplow Analytics and what problem were you trying to solve when you started the company?
What is unique about customer event data from an ingestion and processing perspective?
Challenges with properly matching up data between sources
Data collection is one of the more difficult aspects of an analytics pipeline because of the potential for inconsistency or incorrect information. How is the collection portion of the Snowplow stack designed and how do you validate the correctness of the data?
Cleanliness/accuracy
What kinds of metrics should be tracked in an ingestion pipeline and how do you monitor them to ensure that everything is operating properly?
Can you describe the overall architecture of the ingest pipeline that Snowplow provides?
How has that architecture evolved from when you first started?
What would you do differently if you were to start over today?
Ensuring appropriate use of enrichment sources
What have been some of the biggest challenges encountered while building and evolving Snowplow?
What are some of the most interesting uses of your platform that you are aware of?
Keep In Touch
Alex
@alexcrdean on Twitter
LinkedIn
Snowplow
@snowplowdata on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Snowplow
GitHub
Deloitte Consulting
OpenX
Hadoop
AWS
EMR (Elastic Map-Reduce)
Business Intelligence
Data Warehousing
Google Analytics
CRM (Customer Relationship Management)
S3
GDPR (General Data Protection Regulation)
Kinesis
Kafka
Google Cloud Pub-Sub
JSON-Schema
Iglu
IAB Bots And Spiders List
Heap Analytics
Podcast Interview
Redshift
SnowflakeDB
Snowplow Insights
Google Cloud Platform
Azure
GitLab
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast

Sep 10, 2018 • 48min
Keep Your Data And Query It Too Using Chaos Search with Thomas Hazel and Pete Cheslock - Episode 47
Summary
Elasticsearch is a powerful tool for storing and analyzing data, but when using it for logs and other time oriented information it can become problematic to keep all of your history. Chaos Search was started to make it easy for you to keep all of your data and make it usable in S3, so that you can have the best of both worlds. In this episode the CTO, Thomas Hazel, and VP of Product, Pete Cheslock, describe how they have built a platform to let you keep all of your history, save money, and reduce your operational overhead. They also explain some of the types of data that you can use with Chaos Search, how to load it into S3, and when you might want to choose it over Amazon Athena for our serverless data analysis.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $/0 credit and launch a new server in under a minute.
You work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Your host is Tobias Macey and today I’m interviewing Pete Cheslock and Thomas Hazel about Chaos Search and their effort to bring historical depth to your Elasticsearch data
Interview
Introduction
How did you get involved in the area of data management?
Can you start by explaining what you have built at Chaos Search and the problems that you are trying to solve with it?
What types of data are you focused on supporting?
What are the challenges inherent to scaling an elasticsearch infrastructure to large volumes of log or metric data?
Is there any need for an Elasticsearch cluster in addition to Chaos Search?
For someone who is using Chaos Search, what mechanisms/formats would they use for loading their data into S3?
What are the benefits of implementing the Elasticsearch API on top of your data in S3 as opposed to using systems such as Presto or Drill to interact with the same information via SQL?
Given that the S3 API has become a de facto standard for many other object storage platforms, what would be involved in running Chaos Search on data stored outside of AWS?
What mechanisms do you use to allow for such drastic space savings of indexed data in S3 versus in an Elasticsearch cluster?
What is the system architecture that you have built to allow for querying terabytes of data in S3?
What are the biggest contributors to query latency and what have you done to mitigate them?
What are the options for access control when running queries against the data stored in S3?
What are some of the most interesting or unexpected uses of Chaos Search and access to large amounts of historical log information that you have seen?
What are your plans for the future of Chaos Search?
Contact Info
Pete Cheslock
@petecheslock on Twitter
Website
Thomas Hazel
@thomashazel on Twitter
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Chaos Search
AWS S3
Cassandra
Elasticsearch
Podcast Interview
PostgreSQL
Distributed Systems
Information Theory
Lucene
Inverted Index
Kibana
Logstash
NVMe
AWS KMS
Kinesis
FluentD
Parquet
Athena
Presto
Drill
Backblaze
OpenStack Swift
Minio
EMR
DataDog
NewRelic
Elastic Beats
Metricbeat
Graphite
Snappy
Scala
Akka
Elastalert
Tensorflow
X-Pack
Data Lake
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast

Sep 3, 2018 • 47min
An Agile Approach To Master Data Management with Mark Marinelli - Episode 46
Summary
With the proliferation of data sources to give a more comprehensive view of the information critical to your business it is even more important to have a canonical view of the entities that you care about. Is customer number 342 in your ERP the same as Bob Smith on Twitter? Using master data management to build a data catalog helps you answer these questions reliably and simplify the process of building your business intelligence reports. In this episode the head of product at Tamr, Mark Marinelli, discusses the challenges of building a master data set, why you should have one, and some of the techniques that modern platforms and systems provide for maintaining it.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
You work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Your host is Tobias Macey and today I’m interviewing Mark Marinelli about data mastering for modern platforms
Interview
Introduction
How did you get involved in the area of data management?
Can you start by establishing a definition of data mastering that we can work from?
How does the master data set get used within the overall analytical and processing systems of an organization?
What is the traditional workflow for creating a master data set?
What has changed in the current landscape of businesses and technology platforms that makes that approach impractical?
What are the steps that an organization can take to evolve toward an agile approach to data mastering?
At what scale of company or project does it makes sense to start building a master data set?
What are the limitations of using ML/AI to merge data sets?
What are the limitations of a golden master data set in practice?
Are there particular formats of data or types of entities that pose a greater challenge when creating a canonical format for them?
Are there specific problem domains that are more likely to benefit from a master data set?
Once a golden master has been established, how are changes to that information handled in practice? (e.g. versioning of the data)
What storage mechanisms are typically used for managing a master data set?
Are there particular security, auditing, or access concerns that engineers should be considering when managing their golden master that goes beyond the rest of their data infrastructure?
How do you manage latency issues when trying to reference the same entities from multiple disparate systems?
What have you found to be the most common stumbling blocks for a group that is implementing a master data platform?
What suggestions do you have to help prevent such a project from being derailed?
What resources do you recommend for someone looking to learn more about the theoretical and practical aspects of data mastering for their organization?
Contact Info
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Tamr
Multi-Dimensional Database
Master Data Management
ETL
EDW (Enterprise Data Warehouse)
Waterfall Development Method
Agile Development Method
DataOps
Feature Engineering
Tableau
Qlik
Data Catalog
PowerBI
RDBMS (Relational Database Management System)
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast

Aug 27, 2018 • 25min
Protecting Your Data In Use At Enveil with Ellison Anne Williams - Episode 45
Summary
There are myriad reasons why data should be protected, and just as many ways to enforce it in tranist or at rest. Unfortunately, there is still a weak point where attackers can gain access to your unencrypted information. In this episode Ellison Anny Williams, CEO of Enveil, describes how her company uses homomorphic encryption to ensure that your analytical queries can be executed without ever having to decrypt your data.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch.
Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat
Your host is Tobias Macey and today I’m interviewing Ellison Anne Williams about Enveil, a pioneering data security company protecting Data in Use
Interview
Introduction
How did you get involved in the area of data security?
Can you start by explaining what your mission is with Enveil and how the company got started?
One of the core aspects of your platform is the principal of homomorphic encryption. Can you explain what that is and how you are using it?
What are some of the challenges associated with scaling homomorphic encryption?
What are some difficulties associated with working on encrypted data sets?
Can you describe the underlying architecture for your data platform?
How has that architecture evolved from when you first began building it?
What are some use cases that are unlocked by having a fully encrypted data platform?
For someone using the Enveil platform, what does their workflow look like?
A major reason for never decrypting data is to protect it from attackers and unauthorized access. What are some of the remaining attack vectors?
What are some aspects of the data being protected that still require additional consideration to prevent leaking information? (e.g. identifying individuals based on geographic data, or purchase patterns)
What do you have planned for the future of Enveil?
Contact Info
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data security today?
Links
Enveil
NSA
GDPR
Intellectual Property
Zero Trust
Homomorphic Encryption
Ciphertext
Hadoop
PII (Personally Identifiable Information)
TLS (Transport Layer Security)
Spark
Elasticsearch
Side-channel attacks
Spectre and Meltdown
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast

Aug 20, 2018 • 43min
Graph Databases In Production At Scale Using DGraph with Manish Jain - Episode 44
Manish Jain, Creator of DGraph, discusses the benefits of storing and querying data as a graph, how DGraph overcomes limitations, building a distributed, consistent database, and the use case of integrating 51 data silos into a single database cluster.


