Linear Digressions

Ben Jaffe and Katie Malone
undefined
May 24, 2020 • 27min

Stein's Paradox

This is a re-release of an episode that was originally released on February 26, 2017. When you're estimating something about some object that's a member of a larger group of similar objects (say, the batting average of a baseball player, who belongs to a baseball team), how should you estimate it: use measurements of the individual, or get some extra information from the group? The James-Stein estimator tells you how to combine individual and group information make predictions that, taken over the whole group, are more accurate than if you treated each individual, well, individually.
undefined
May 18, 2020 • 21min

Protecting Individual-Level Census Data with Differential Privacy

The power of finely-grained, individual-level data comes with a drawback: it compromises the privacy of potentially anyone and everyone in the dataset. Even for de-identified datasets, there can be ways to re-identify the records or otherwise figure out sensitive personal information. That problem has motivated the study of differential privacy, a set of techniques and definitions for keeping personal information private when datasets are released or used for study. Differential privacy is getting a big boost this year, as it’s being implemented across the 2020 US Census as a way of protecting the privacy of census respondents while still opening up the dataset for research and policy use. When two important topics come together like this, we can’t help but sit up and pay attention.
undefined
May 11, 2020 • 15min

Causal Trees

What do you get when you combine the causal inference needs of econometrics with the data-driven methodology of machine learning? Usually these two don’t go well together (deriving causal conclusions from naive data methods leads to biased answers) but economists Susan Athey and Guido Imbens are on the case. This episodes explores their algorithm for recursively partitioning a dataset to find heterogeneous treatment effects, or for you ML nerds, applying decision trees to causal inference problems. It’s not a free lunch, but for those (like us!) who love crossover topics, causal trees are a smart approach from one field hopping the fence to another. Relevant links: https://www.pnas.org/content/113/27/7353
undefined
May 4, 2020 • 36min

The Grammar Of Graphics

You may not realize it consciously, but beautiful visualizations have rules. The rules are often implict and manifest themselves as expectations about how the data is summarized, presented, and annotated so you can quickly extract the information in the underlying data using just visual cues. It’s a bit abstract but very profound, and these principles underlie the ggplot2 package in R that makes famously beautiful plots with minimal code. This episode covers a paper by Hadley Wickham (author of ggplot2, among other R packages) that unpacks the layered approach to graphics taken in ggplot2, and makes clear the assumptions and structure of many familiar data visualizations.
undefined
Apr 27, 2020 • 21min

Gaussian Processes

It’s pretty common to fit a function to a dataset when you’re a data scientist. But in many cases, it’s not clear what kind of function might be most appropriate—linear? quadratic? sinusoidal? some combination of these, and perhaps others? Gaussian processes introduce a nonparameteric option where you can fit over all the possible types of functions, using the data points in your datasets as constraints on the results that you get (the idea being that, no matter what the “true” underlying function is, it produced the data points you’re trying to fit). What this means is a very flexible, but depending on your parameters not-too-flexible, way to fit complex datasets. The math underlying GPs gets complex, and the links below contain some excellent visualizations that help make the underlying concepts clearer. Check them out! Relevant links: http://katbailey.github.io/post/gaussian-processes-for-dummies/ https://thegradient.pub/gaussian-process-not-quite-for-dummies/ https://distill.pub/2019/visual-exploration-gaussian-processes/
undefined
Apr 20, 2020 • 19min

Keeping ourselves honest when we work with observational healthcare data

Exploring the challenges of working with observational healthcare data, including the numerous decisions data scientists need to make. The podcast discusses strategies and techniques to make unbiased choices in analyzing data, ensuring accuracy in causal inference. A benchmark study is highlighted, showcasing different analysis approaches in healthcare research.
undefined
Apr 13, 2020 • 29min

Changing our formulation of AI to avoid runaway risks: Interview with Prof. Stuart Russell

AI is evolving incredibly quickly, and thinking now about where it might go next (and how we as a species and a society should be prepared) is critical. Professor Stuart Russell, an AI expert at UC Berkeley, has a formulation for modifications to AI that we should study and try implementing now to keep it much safer in the long run. Prof. Russell’s new book, “Human Compatible: Artificial Intelligence and the Problem of Control” gives an accessible but deeply thoughtful exploration of why he thinks runaway AI is something we need to be considering seriously now, and what changes in formulation might be a solution. This episodes features Prof. Russell as a special guest, exploring the topics in his book and giving more perspective on the long-term possible futures of AI: both good and bad. Relevant links: https://www.penguinrandomhouse.com/books/566677/human-compatible-by-stuart-russell/
undefined
Apr 6, 2020 • 24min

Putting machine learning into a database

Most data scientists bounce back and forth regularly between doing analysis in databases using SQL and building and deploying machine learning pipelines in R or python. But if we think ahead a few years, a few visionary researchers are starting to see a world in which the ML pipelines can actually be deployed inside the database. Why? One strong advantage for databases is they have built-in features for data governance, including things like permissioning access and tracking the provenance of data. Adding machine learning as another thing you can do in a database means that, potentially, these enterprise-grade features will be available for ML models too, which will make them much more widely accepted across enterprises with tight IT policies. The papers this week articulate the gap between enterprise needs and current ML infrastructure, how ML in a database could be a way to knit the two closer together, and a proof-of-concept that ML in a database can actually work. Relevant links: https://blog.acolyer.org/2020/02/19/ten-year-egml-predictions/ https://blog.acolyer.org/2020/02/21/extending-relational-query-processing/
undefined
Mar 29, 2020 • 29min

The work-from-home episode

Many of us have the privilege of working from home right now, in an effort to keep ourselves and our family safe and slow the transmission of covid-19. But working from home is an adjustment for many of us, and can hold some challenges compared to coming in to the office every day. This episode explores this a little bit, informally, as we compare our new work-from-home setups and reflect on what’s working well and what we’re finding challenging.
undefined
Mar 23, 2020 • 25min

Understanding Covid-19 transmission: what the data suggests about how the disease spreads

Covid-19 is turning the world upside down right now. One thing that’s extremely important to understand, in order to fight it as effectively as possible, is how the virus spreads and especially how much of the spread of the disease comes from carriers who are experiencing no or mild symptoms but are contagious anyway. This episode digs into the epidemiological model that was published in Science this week—this model finds that the data suggests that the majority of carriers of the coronavirus, 80-90%, do not have a detected disease. This has big implications for the importance of social distancing of a way to get the pandemic under control and explains why a more comprehensive testing program is critical for the United States. Also, in lighter news, Katie (a native of Dayton, Ohio) lays a data-driven claim for just declaring the University of Dayton flyers to be the 2020 NCAA College Basketball champions. Relevant links: https://science.sciencemag.org/content/early/2020/03/13/science.abb3221

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app