Linear Digressions

Ben Jaffe and Katie Malone
undefined
Mar 26, 2018 • 13min

Google Flu Trends

It's been a nasty flu season this year. So we were remembering a story from a few years back (but not covered yet on this podcast) about when Google tried to predict flu outbreaks faster than the Centers for Disease Control by monitoring searches and looking for spikes in searches for flu symptoms, doctors appointments, and other related terms. It's a cool idea, but after a few years turned into a cautionary tale of what can go wrong after Google's algorithm systematically overestimated flu incidence for almost 2 years straight. Relevant link: https://gking.harvard.edu/publications/parable-google-flu%C2%A0traps-big-data-analysis
undefined
Mar 19, 2018 • 31min

How to pick projects for a professional data science team

This week's episodes is for data scientists, sure, but also for data science managers and executives at companies with data science teams. These folks all think very differently about the same question: what should a data science team be working on? And how should that decision be made? That's the subject of a talk that I (Katie) gave at Strata Data in early March, about how my co-department head and I select projects for our team to work on. We have several goals in data science project selection at Civis Analytics (where I work), which can be summarized under "balance the best attributes of bottom-up and top-down decision-making." We achieve this balance, or at least get pretty close, using a process we've come to call the Idea Factory (after a great book about Bell Labs). This talk is about that process, how it works in the real world of a data science company and how we see it working in the data science programs of other companies. Relevant links: https://conferences.oreilly.com/strata/strata-ca/public/schedule/detail/63905
undefined
Mar 12, 2018 • 13min

Autoencoders

Autoencoders are neural nets that are optimized for creating outputs that... look like the inputs to the network. Turns out this is a not-too-shabby way to do unsupervised machine learning with neural nets.
undefined
Mar 5, 2018 • 26min

When Private Data Isn't Private Anymore

After all the back-patting around making data science datasets and code more openly available, we figured it was time to also dump a bucket of cold water on everyone's heads and talk about the things that can go wrong when data and code is a little too open. In this episode, we'll talk about two interesting recent examples: a de-identified medical dataset in Australia that was re-identified so specific celebrities and athletes could be matched to their medical records, and a series of military bases that were spotted in a public fitness tracker dataset.
undefined
Feb 26, 2018 • 35min

What makes a machine learning algorithm "superhuman"?

A few weeks ago, we podcasted about a neural network that was being touted as "better than doctors" in diagnosing pneumonia from chest x-rays, and how the underlying dataset used to train the algorithm raised some serious questions. We're back again this week with further developments, as the author of the original blog post pointed us toward more developments. All in all, there's a lot more clarity now around how the authors arrived at their original "better than doctors" claim, and a number of adjustments and improvements as the original result was de/re-constructed. Anyway, there are a few things that are cool about this. First, it's a worthwhile follow-up to a popular recent episode. Second, it goes *inside* an analysis to see what things like imbalanced classes, outliers, and (possible) signal leakage can do to real science. And last, it raises a really interesting question in an age when computers are often claimed to be better than humans: what do those claims really mean? Relevant links: https://lukeoakdenrayner.wordpress.com/2018/01/24/chexnet-an-in-depth-review/
undefined
Feb 19, 2018 • 17min

Open Data and Open Science

One interesting trend we've noted recently is the proliferation of papers, articles and blog posts about data science that don't just tell the result--they include data and code that allow anyone to repeat the analysis. It's far from universal (for a timely counterpoint, read this article ), but we seem to be moving toward a new normal where data science conclusions are expected to be shown, not just told. Relevant links: https://github.com/fivethirtyeight/data https://blog.patricktriest.com/police-data-python/
undefined
Feb 12, 2018 • 20min

Defining the quality of a machine learning production system

Building a machine learning system and maintaining it in production are two very different things. Some folks over at Google wrote a paper that shares their thoughts around all the items you might want to test or check for your production ML system. Relevant links: https://research.google.com/pubs/pub45742.html
undefined
Feb 4, 2018 • 19min

Auto-generating websites with deep learning

We've already talked about neural nets in some detail (links below), and in particular we've been blown away by the way that image recognition from convolutional neural nets can be fed into recurrent neural nets that generate descriptions and captions of the images. Our episode today tells a similar tale, except today we're talking about a blog post where the author fed in wireframes of a website design and asked the neural net to generate the HTML and CSS that would actually build a website that looks like the wireframes. If you're a programmer who thinks your job is challenging enough that you're automation-proof, guess again... Link to blog post: https://blog.floydhub.com/turning-design-mockups-into-code-with-deep-learning/
undefined
Jan 29, 2018 • 21min

The Case for Learned Index Structures, Part 2: Hash Maps and Bloom Filters

Last week we started the story of how you could use a machine learning model in place of a data structure, and this week we wrap up with an exploration of Bloom Filters and Hash Maps. Just like last week, when we covered B-trees, we'll walk through both the "classic" implementation of these data structures and how a machine learning model could create the same functionality.
undefined
Jan 22, 2018 • 19min

The Case for Learned Index Structures, Part 1: B-Trees

Jeff Dean and his collaborators at Google are turning the machine learning world upside down (again) with a recent paper about how machine learning models can be used as surprisingly effective substitutes for classic data structures. In this first part of a two-part series, we'll go through a data structure called b-trees. The structural form of b-trees make them efficient for searching, but if you squint at a b-tree and look at it a little bit sideways then the search functionality starts to look a little bit like a regression model--hence the relevance of machine learning models. If this sounds kinda weird, or we lost you at b-tree, don't worry--lots more details in the episode itself.

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app