

Data Engineering Podcast
Tobias Macey
This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.
Episodes
Mentioned books

Mar 19, 2018 • 51min
Stretching The Elastic Stack with Philipp Krenn - Episode 23
Summary
Search is a common requirement for applications of all varieties. Elasticsearch was built to make it easy to include search functionality in projects built in any language. From that foundation, the rest of the Elastic Stack has been built, expanding to many more use cases in the proces. In this episode Philipp Krenn describes the various pieces of the stack, how they fit together, and how you can use them in your infrastructure to store, search, and analyze your data.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data management
When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute.
For complete visibility into the health of your pipeline, including deployment tracking, and powerful alerting driven by machine-learning, DataDog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you’ll have everything you need to find and fix performance bottlenecks in no time. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new T-Shirt.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
Your host is Tobias Macey and today I’m interviewing Philipp Krenn about the Elastic Stack and the ways that you can use it in your systems
Interview
Introduction
How did you get involved in the area of data management?
The Elasticsearch product has been around for a long time and is widely known, but can you give a brief overview of the other components that make up the Elastic Stack and how they work together?
Beyond the common pattern of using Elasticsearch as a search engine connected to a web application, what are some of the other use cases for the various pieces of the stack?
What are the common scaling bottlenecks that users should be aware of when they are dealing with large volumes of data?
What do you consider to be the biggest competition to the Elastic Stack as you expand the capabilities and target usage patterns?
What are the biggest challenges that you are tackling in the Elastic stack, technical or otherwise?
What are the biggest challenges facing Elastic as a company in the near to medium term?
Open source as a business model: https://www.elastic.co/blog/doubling-down-on-open?utm_source=rss&utm_medium=rss
What is the vision for Elastic and the Elastic Stack going forward and what new features or functionality can we look forward to?
Contact Info
@xeraa on Twitter
xeraa on GitHub
Website
Email
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Elastic
Vienna – Capital of Austria
What Is Developer Advocacy?
NoSQL
MongoDB
Elasticsearch
Cassandra
Neo4J
Hazelcast
Apache Lucene
Logstash
Kibana
Beats
X-Pack
ELK Stack
Metrics
APM (Application Performance Monitoring)
GeoJSON
Split Brain
Elasticsearch Ingest Nodes
PacketBeat
Elastic Cloud
Elasticon
Kibana Canvas
SwiftType
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast

Mar 12, 2018 • 49min
Database Refactoring Patterns with Pramod Sadalage - Episode 22
Summary
As software lifecycles move faster, the database needs to be able to keep up. Practices such as version controlled migration scripts and iterative schema evolution provide the necessary mechanisms to ensure that your data layer is as agile as your application. Pramod Sadalage saw the need for these capabilities during the early days of the introduction of modern development practices and co-authored a book to codify a large number of patterns to aid practitioners, and in this episode he reflects on the current state of affairs and how things have changed over the past 12 years.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure
When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
You can help support the show by checking out the Patreon page which is linked from the site.
To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers
Your host is Tobias Macey and today I’m interviewing Pramod Sadalage about refactoring databases and integrating database design into an iterative development workflow
Interview
Introduction
How did you get involved in the area of data management?
You first co-authored Refactoring Databases in 2006. What was the state of software and database system development at the time and why did you find it necessary to write a book on this subject?
What are the characteristics of a database that make them more difficult to manage in an iterative context?
How does the practice of refactoring in the context of a database compare to that of software?
How has the prevalence of data abstractions such as ORMs or ODMs impacted the practice of schema design and evolution?
Is there a difference in strategy when refactoring the data layer of a system when using a non-relational storage system?
How has the DevOps movement and the increased focus on automation affected the state of the art in database versioning and evolution?
What have you found to be the most problematic aspects of databases when trying to evolve the functionality of a system?
Looking back over the past 12 years, what has changed in the areas of database design and evolution?
How has the landscape of tooling for managing and applying database versioning changed since you first wrote Refactoring Databases?
What do you see as the biggest challenges facing us over the next few years?
Contact Info
Website
pramodsadalage on GitHub
@pramodsadalage on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Database Refactoring
Website
Book
Thoughtworks
Martin Fowler
Agile Software Development
XP (Extreme Programming)
Continuous Integration
The Book
Wikipedia
Test First Development
DDL (Data Definition Language)
DML (Data Modification Language)
DevOps
Flyway
Liquibase
DBMaintain
Hibernate
SQLAlchemy
ORM (Object Relational Mapper)
ODM (Object Document Mapper)
NoSQL
Document Database
MongoDB
OrientDB
CouchBase
CassandraDB
Neo4j
ArangoDB
Unit Testing
Integration Testing
OLAP (On-Line Analytical Processing)
OLTP (On-Line Transaction Processing)
Data Warehouse
Docker
QA==Quality Assurance
HIPAA (Health Insurance Portability and Accountability Act)
PCI DSS (Payment Card Industry Data Security Standard)
Polyglot Persistence
Toplink Java ORM
Ruby on Rails
ActiveRecord Gem
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast

Mar 5, 2018 • 43min
The Future Data Economy with Roger Chen - Episode 21
Summary
Data is an increasingly sought after raw material for business in the modern economy. One of the factors driving this trend is the increase in applications for machine learning and AI which require large quantities of information to work from. As the demand for data becomes more widespread the market for providing it will begin transform the ways that information is collected and shared among and between organizations. With his experience as a chair for the O’Reilly AI conference and an investor for data driven businesses Roger Chen is well versed in the challenges and solutions being facing us. In this episode he shares his perspective on the ways that businesses can work together to create shared data resources that will allow them to reduce the redundancy of their foundational data and improve their overall effectiveness in collecting useful training sets for their particular products.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure
When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
You can help support the show by checking out the Patreon page which is linked from the site.
To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers
A few announcements:
The O’Reilly AI Conference is also coming up. Happening April 29th to the 30th in New York it will give you a solid understanding of the latest breakthroughs and best practices in AI for business. Go to dataengineeringpodcast.com/aicon-new-york to register and save 20%
If you work with data or want to learn more about how the projects you have heard about on the show get used in the real world then join me at the Open Data Science Conference in Boston from May 1st through the 4th. It has become one of the largest events for data scientists, data engineers, and data driven businesses to get together and learn how to be more effective. To save 60% off your tickets go to dataengineeringpodcast.com/odsc-east-2018 and register.
Your host is Tobias Macey and today I’m interviewing Roger Chen about data liquidity and its impact on our future economies
Interview
Introduction
How did you get involved in the area of data management?
You wrote an essay discussing how the increasing usage of machine learning and artificial intelligence applications will result in a demand for data that necessitates what you refer to as ‘Data Liquidity’. Can you explain what you mean by that term?
What are some examples of the types of data that you envision as being foundational to multiple organizations and problem domains?
Can you provide some examples of the structures that could be created to facilitate data sharing across organizational boundaries?
Many companies view their data as a strategic asset and are therefore loathe to provide access to other individuals or organizations. What encouragement can you provide that would convince them to externalize any of that information?
What kinds of storage and transmission infrastructure and tooling are necessary to allow for wider distribution of, and collaboration on, data assets?
What do you view as being the privacy implications from creating and sharing these larger pools of data inventory?
What do you view as some of the technical challenges associated with identifying and separating shared data from those that are specific to the business model of the organization?
With broader access to large data sets, how do you anticipate that impacting the types of businesses or products that are possible for smaller organizations?
Contact Info
@rgrchen on Twitter
LinkedIn
Angel List
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Electrical Engineering
Berkeley
Silicon Nanophotonics
Data Liquidity In The Age Of Inference
Data Silos
Example of a Data Commons Cooperative
Google Maps Moat: An article describing how Google Maps has refined raw data to create a new product
Genomics
Phenomics
ImageNet
Open Data
Data Brokerage
Smart Contracts
IPFS
Dat Protocol
Homomorphic Encryption
FileCoin
Data Programming
Snorkel
Website
Podcast Interview
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast

Feb 26, 2018 • 42min
Honeycomb Data Infrastructure with Sam Stokes - Episode 20
Summary
One of the sources of data that often gets overlooked is the systems that we use to run our businesses. This data is not used to directly provide value to customers or understand the functioning of the business, but it is still a critical component of a successful system. Sam Stokes is an engineer at Honeycomb where he helps to build a platform that is able to capture all of the events and context that occur in our production environments and use them to answer all of your questions about what is happening in your system right now. In this episode he discusses the challenges inherent in capturing and analyzing event data, the tools that his team is using to make it possible, and how this type of knowledge can be used to improve your critical infrastructure.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure
When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
You can help support the show by checking out the Patreon page which is linked from the site.
To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers
A few announcements:
There is still time to register for the O’Reilly Strata Conference in San Jose, CA March 5th-8th. Use the link dataengineeringpodcast.com/strata-san-jose to register and save 20%
The O’Reilly AI Conference is also coming up. Happening April 29th to the 30th in New York it will give you a solid understanding of the latest breakthroughs and best practices in AI for business. Go to dataengineeringpodcast.com/aicon-new-york to register and save 20%
If you work with data or want to learn more about how the projects you have heard about on the show get used in the real world then join me at the Open Data Science Conference in Boston from May 1st through the 4th. It has become one of the largest events for data scientists, data engineers, and data driven businesses to get together and learn how to be more effective. To save 60% off your tickets go to dataengineeringpodcast.com/odsc-east-2018 and register.
Your host is Tobias Macey and today I’m interviewing Sam Stokes about his work at Honeycomb, a modern platform for observability of software systems
Interview
Introduction
How did you get involved in the area of data management?
What is Honeycomb and how did you get started at the company?
Can you start by giving an overview of your data infrastructure and the path that an event takes from ingest to graph?
What are the characteristics of the event data that you are dealing with and what challenges does it pose in terms of processing it at scale?
In addition to the complexities of ingesting and storing data with a high degree of cardinality, being able to quickly analyze it for customer reporting poses a number of difficulties. Can you explain how you have built your systems to facilitate highly interactive usage patterns?
A high degree of visibility into a running system is desirable for developers and systems adminstrators, but they are not always willing or able to invest the effort to fully instrument the code or servers that they want to track. What have you found to be the most difficult aspects of data collection, and do you have any tooling to simplify the implementation for user?
How does Honeycomb compare to other systems that are available off the shelf or as a service, and when is it not the right tool?
What have been some of the most challenging aspects of building, scaling, and marketing Honeycomb?
Contact Info
@samstokes on Twitter
Blog
samstokes on GitHub
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Honeycomb
Retriever
Monitoring and Observability
Kafka
Column Oriented Storage
Elasticsearch
Elastic Stack
Django
Ruby on Rails
Heroku
Kubernetes
Launch Darkly
Splunk
Datadog
Cynefin Framework
Go-Lang
Terraform
AWS
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast

Feb 19, 2018 • 29min
Data Teams with Will McGinnis - Episode 19
Summary
The responsibilities of a data scientist and a data engineer often overlap and occasionally come to cross purposes. Despite these challenges it is possible for the two roles to work together effectively and produce valuable business outcomes. In this episode Will McGinnis discusses the opinions that he has gained from experience on how data teams can play to their strengths to the benefit of all.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure
When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
You can help support the show by checking out the Patreon page which is linked from the site.
To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers
A few announcements:
There is still time to register for the O’Reilly Strata Conference in San Jose, CA March 5th-8th. Use the link dataengineeringpodcast.com/strata-san-jose to register and save 20%
The O’Reilly AI Conference is also coming up. Happening April 29th to the 30th in New York it will give you a solid understanding of the latest breakthroughs and best practices in AI for business. Go to dataengineeringpodcast.com/aicon-new-york to register and save 20%
If you work with data or want to learn more about how the projects you have heard about on the show get used in the real world then join me at the Open Data Science Conference in Boston from May 1st through the 4th. It has become one of the largest events for data scientists, data engineers, and data driven businesses to get together and learn how to be more effective. To save 60% off your tickets go to dataengineeringpodcast.com/odsc-east-2018 and register.
Your host is Tobias Macey and today I’m interviewing Will McGinnis about the relationship and boundaries between data engineers and data scientists
Interview
Introduction
How did you get involved in the area of data management?
The terms “Data Scientist” and “Data Engineer” are fluid and seem to have a different meaning for everyone who uses them. Can you share how you define those terms?
What parallels do you see between the relationships of data engineers and data scientists and those of developers and systems administrators?
Is there a particular size of organization or problem that serves as a tipping point for when you start to separate the two roles into the responsibilities of more than one person or team?
What are the benefits of splitting the responsibilities of data engineering and data science?
What are the disadvantages?
What are some strategies to ensure successful interaction between data engineers and data scientists?
How do you view these roles evolving as they become more prevalent across companies and industries?
Contact Info
Website
wdm0006 on GitHub
@willmcginniser on Twitter
LinkedIn
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Blog Post: Tendencies of Data Engineers and Data Scientists
Predikto
Categorical Encoders
DevOps
SciKit-Learn
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast

Feb 11, 2018 • 1h 3min
TimescaleDB: Fast And Scalable Timeseries with Ajay Kulkarni and Mike Freedman - Episode 18
Ajay Kulkarni and Mike Freedman, co-founders of TimescaleDB, discuss the origins and challenges of building a scalable time series database. They explain how TimescaleDB handles out-of-order data and infrequent sensor connections. They also share insights into marketing and business aspects, including the decision to release the code base as open source, future plans for the enterprise version, and the support and investment structure for the open source business model.

Feb 4, 2018 • 54min
Pulsar: Fast And Scalable Messaging with Rajan Dhabalia and Matteo Merli - Episode 17
Summary
One of the critical components for modern data infrastructure is a scalable and reliable messaging system. Publish-subscribe systems have been popular for many years, and recently stream oriented systems such as Kafka have been rising in prominence. This week Rajan Dhabalia and Matteo Merli discuss the work they have done on Pulsar, which supports both options, in addition to being globally scalable and fast. They explain how Pulsar is architected, how to scale it, and how it fits into your existing infrastructure.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure
When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
You can help support the show by checking out the Patreon page which is linked from the site.
To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers
A few announcements:
There is still time to register for the O’Reilly Strata Conference in San Jose, CA March 5th-8th. Use the link dataengineeringpodcast.com/strata-san-jose to register and save 20%
The O’Reilly AI Conference is also coming up. Happening April 29th to the 30th in New York it will give you a solid understanding of the latest breakthroughs and best practices in AI for business. Go to dataengineeringpodcast.com/aicon-new-york to register and save 20%
If you work with data or want to learn more about how the projects you have heard about on the show get used in the real world then join me at the Open Data Science Conference in Boston from May 1st through the 4th. It has become one of the largest events for data scientists, data engineers, and data driven businesses to get together and learn how to be more effective. To save 60% off your tickets go to dataengineeringpodcast.com/odsc-east-2018 and register.
Your host is Tobias Macey and today I’m interviewing Rajan Dhabalia and Matteo Merli about Pulsar, a distributed open source pub-sub messaging system
Interview
Introduction
How did you get involved in the area of data management?
Can you start by explaining what Pulsar is and what the original inspiration for the project was?
What have been some of the most challenging aspects of building and promoting Pulsar?
For someone who wants to run Pulsar, what are the infrastructure and network requirements that they should be considering and what is involved in deploying the various components?
What are the scaling factors for Pulsar and what aspects of deployment and administration should users pay special attention to?
What projects or services do you consider to be competitors to Pulsar and what makes it stand out in comparison?
The documentation mentions that there is an API layer that provides drop-in compatibility with Kafka. Does that extend to also supporting some of the plugins that have developed on top of Kafka?
One of the popular aspects of Kafka is the persistence of the message log, so I’m curious how Pulsar manages long-term storage and reprocessing of messages that have already been acknowledged?
When is Pulsar the wrong tool to use?
What are some of the improvements or new features that you have planned for the future of Pulsar?
Contact Info
Matteo
merlimat on GitHub
@merlimat on Twitter
Rajan
@dhabaliaraj on Twitter
rhabalia on GitHub
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Pulsar
Publish-Subscribe
Yahoo
Streamlio
ActiveMQ
Kafka
Bookkeeper
SLA (Service Level Agreement)
Write-Ahead Log
Ansible
Zookeeper
Pulsar Deployment Instructions
RabbitMQ
Confluent Schema Registry
Podcast Interview
Kafka Connect
Wallaroo
Podcast Interview
Kinesis
Athenz
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast

Jan 29, 2018 • 1h 3min
Dat: Distributed Versioned Data Sharing with Danielle Robinson and Joe Hand - Episode 16
Summary
Sharing data across multiple computers, particularly when it is large and changing, is a difficult problem to solve. In order to provide a simpler way to distribute and version data sets among collaborators the Dat Project was created. In this episode Danielle Robinson and Joe Hand explain how the project got started, how it functions, and some of the many ways that it can be used. They also explain the plans that the team has for upcoming features and uses that you can watch out for in future releases.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure
When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.
Continuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production and GoCD is the open source platform made by the people at Thoughtworks who wrote the book about it. Go to dataengineeringpodcast.com/gocd to download and launch it today. Enterprise add-ons and professional support are available for added peace of mind.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
You can help support the show by checking out the Patreon page which is linked from the site.
To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers
A few announcements:
There is still time to register for the O’Reilly Strata Conference in San Jose, CA March 5th-8th. Use the link dataengineeringpodcast.com/strata-san-jose to register and save 20%
The O’Reilly AI Conference is also coming up. Happening April 29th to the 30th in New York it will give you a solid understanding of the latest breakthroughs and best practices in AI for business. Go to dataengineeringpodcast.com/aicon-new-york to register and save 20%
If you work with data or want to learn more about how the projects you have heard about on the show get used in the real world then join me at the Open Data Science Conference in Boston from May 1st through the 4th. It has become one of the largest events for data scientists, data engineers, and data driven businesses to get together and learn how to be more effective. To save 60% off your tickets go to dataengineeringpodcast.com/odsc-east-2018 and register.
Your host is Tobias Macey and today I’m interviewing Danielle Robinson and Joe Hand about Dat Project, a distributed data sharing protocol for building applications of the future
Interview
Introduction
How did you get involved in the area of data management?
What is the Dat project and how did it get started?
How have the grants to the Dat project influenced the focus and pace of development that was possible?
Now that you have established a non-profit organization around Dat, what are your plans to support future sustainability and growth of the project?
Can you explain how the Dat protocol is designed and how it has evolved since it was first started?
How does Dat manage conflict resolution and data versioning when replicating between multiple machines?
One of the primary use cases that is mentioned in the documentation and website for Dat is that of hosting and distributing open data sets, with a focus on researchers. How does Dat help with that effort and what improvements does it offer over other existing solutions?
One of the difficult aspects of building a peer-to-peer protocol is that of establishing a critical mass of users to add value to the network. How have you approached that effort and how much progress do you feel that you have made?
How does the peer-to-peer nature of the platform affect the architectural patterns for people wanting to build applications that are delivered via dat, vs the common three-tier architecture oriented around persistent databases?
What mechanisms are available for content discovery, given the fact that Dat URLs are private and unguessable by default?
For someone who wants to start using Dat today, what is involved in creating and/or consuming content that is available on the network?
What have been the most challenging aspects of building and promoting Dat?
What are some of the most interesting or inspiring uses of the Dat protocol that you are aware of?
Contact Info
Dat
datproject.org
Email
@dat_project on Twitter
Dat Chat
Danielle
Email
@daniellecrobins
Joe
Email
@joeahand on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Dat Project
Code For Science and Society
Neuroscience
Cell Biology
OpenCon
Mozilla Science
Open Education
Open Access
Open Data
Fortune 500
Data Warehouse
Knight Foundation
Alfred P. Sloan Foundation
Gordon and Betty Moore Foundation
Dat In The Lab
Dat in the Lab blog posts
California Digital Library
IPFS
Dat on Open Collective – COMING SOON!
ScienceFair
Stencila
eLIFE
Git
BitTorrent
Dat Whitepaper
Merkle Tree
Certificate Transparency
Dat Protocol Working Group
Dat Multiwriter Development – Hyperdb
Beaker Browser
WebRTC
IndexedDB
Rust
C
Keybase
PGP
Wire
Zenodo
Dryad Data Sharing
Dataverse
RSync
FTP
Globus
Fritter
Fritter Demo
Rotonde how to
Joe’s website on Dat
Dat Tutorial
Data Rescue – NYTimes Coverage
Data.gov
Libraries+ Network
UC Conservation Genomics Consortium
Fair Data principles
hypervision
hypervision in browser
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA
Click here to read the unedited transcript…
Tobias Macey 00:13…

Jan 22, 2018 • 37min
Snorkel: Extracting Value From Dark Data with Alex Ratner - Episode 15
Summary
The majority of the conversation around machine learning and big data pertains to well-structured and cleaned data sets. Unfortunately, that is just a small percentage of the information that is available, so the rest of the sources of knowledge in a company are housed in so-called “Dark Data” sets. In this episode Alex Ratner explains how the work that he and his fellow researchers are doing on Snorkel can be used to extract value by leveraging labeling functions written by domain experts to generate training sets for machine learning models. He also explains how this approach can be used to democratize machine learning by making it feasible for organizations with smaller data sets than those required by most tooling.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure
When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
You can help support the show by checking out the Patreon page which is linked from the site.
To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers
Your host is Tobias Macey and today I’m interviewing Alex Ratner about Snorkel and Dark Data
Interview
Introduction
How did you get involved in the area of data management?
Can you start by sharing your definition of dark data and how Snorkel helps to extract value from it?
What are some of the most challenging aspects of building labelling functions and what tools or techniques are available to verify their validity and effectiveness in producing accurate outcomes?
Can you provide some examples of how Snorkel can be used to build useful models in production contexts for companies or problem domains where data collection is difficult to do at large scale?
For someone who wants to use Snorkel, what are the steps involved in processing the source data and what tooling or systems are necessary to analyse the outputs for generating usable insights?
How is Snorkel architected and how has the design evolved over its lifetime?
What are some situations where Snorkel would be poorly suited for use?
What are some of the most interesting applications of Snorkel that you are aware of?
What are some of the other projects that you and your group are working on that interact with Snorkel?
What are some of the features or improvements that you have planned for future releases of Snorkel?
Contact Info
Website
ajratner on Github
@ajratner on Twitter
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Stanford
DAWN
HazyResearch
Snorkel
Christopher Ré
Dark Data
DARPA
Memex
Training Data
FDA
ImageNet
National Library of Medicine
Empirical Studies of Conflict
Data Augmentation
PyTorch
Tensorflow
Generative Model
Discriminative Model
Weak Supervision
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast

Jan 15, 2018 • 46min
CRDTs and Distributed Consensus with Christopher Meiklejohn - Episode 14
Summary
As we scale our systems to handle larger volumes of data, geographically distributed users, and varied data sources the requirement to distribute the computational resources for managing that information becomes more pronounced. In order to ensure that all of the distributed nodes in our systems agree with each other we need to build mechanisms to properly handle replication of data and conflict resolution. In this episode Christopher Meiklejohn discusses the research he is doing with Conflict-Free Replicated Data Types (CRDTs) and how they fit in with existing methods for sharing and sharding data. He also shares resources for systems that leverage CRDTs, how you can incorporate them into your systems, and when they might not be the right solution. It is a fascinating and informative treatment of a topic that is becoming increasingly relevant in a data driven world.
Preamble
Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure
When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show.
Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch.
You can help support the show by checking out the Patreon page which is linked from the site.
To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers
Your host is Tobias Macey and today I’m interviewing Christopher Meiklejohn about establishing consensus in distributed systems
Interview
Introduction
How did you get involved in the area of data management?
You have dealt with CRDTs with your work in industry, as well as in your research. Can you start by explaining what a CRDT is, how you first began working with them, and some of their current manifestations?
Other than CRDTs, what are some of the methods for establishing consensus across nodes in a system and how does increased scale affect their relative effectiveness?
One of the projects that you have been involved in which relies on CRDTs is LASP. Can you describe what LASP is and what your role in the project has been?
Can you provide examples of some production systems or available tools that are leveraging CRDTs?
If someone wants to take advantage of CRDTs in their applications or data processing, what are the available off-the-shelf options, and what would be involved in implementing custom data types?
What areas of research are you most excited about right now?
Given that you are currently working on your PhD, do you have any thoughts on the projects or industries that you would like to be involved in once your degree is completed?
Contact Info
Website
cmeiklejohn on GitHub
Google Scholar Citations
Parting Question
From your perspective, what is the biggest gap in the tooling or technology for data management today?
Links
Basho
Riak
Syncfree
LASP
CRDT
Mesosphere
CAP Theorem
Cassandra
DynamoDB
Bayou System (Xerox PARC)
Multivalue Register
Paxos
RAFT
Byzantine Fault Tolerance
Two Phase Commit
Spanner
ReactiveX
Tensorflow
Erlang
Docker
Kubernetes
Erleans
Orleans
Atom Editor
Automerge
Martin Klepman
Akka
Delta CRDTs
Antidote DB
Kops
Eventual Consistency
Causal Consistency
ACID Transactions
Joe Hellerstein
The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SASupport Data Engineering Podcast


