Linear Digressions
Ben Jaffe and Katie Malone
Linear Digressions is a podcast about machine learning and data science. Machine learning is being used to solve a ton of interesting problems, and to accomplish goals that were out of reach even a few short years ago.
Episodes
Mentioned books
Jun 19, 2017 • 16min
Anscombe's Quartet
Anscombe's Quartet is a set of four datasets that have the same mean, variance and correlation but look very different. It's easy to think that having a good set of summary statistics (like mean, variance and correlation) can tell you everything important about a dataset, or at least enough to know if two datasets are extremely similar or extremely different, but Anscombe's Quartet will always be standing behind you, laughing at how silly that idea is.
Anscombe's Quartet was devised in 1973 as an example of how summary statistics can be misleading, but today we can even do one better: the Datasaurus Dozen is a set of twelve datasets, all extremely visually distinct, that have the same summary stats as a source dataset that, there's no other way to put this, looks like a dinosaur. It's an example of how datasets can be generated to look like almost anything while still preserving arbitrary summary statistics. In other words, Anscombe's Quartets can be generated at-will and we all should be reminded to visualize our data (not just compute summary statistics) if we want to claim to really understand it.
Jun 12, 2017 • 19min
Traffic Metering Algorithms
Originally release June 2016
This episode is for all you (us) traffic nerds--we're talking about the hidden structure underlying traffic on-ramp metering systems. These systems slow down the flow of traffic onto highways so that the highways don't get overloaded with cars and clog up. If you're someone who listens to podcasts while commuting, and especially if your area has on-ramp metering, you'll never look at highway access control the same way again (yeah, we know this is super nerdy; it's also super awesome).
Jun 5, 2017 • 20min
Page Rank
The year: 1998. The size of the web: 150 million pages. The problem: information retrieval. How do you find the "best" web pages to return in response to a query? A graduate student named Larry Page had an idea for how it could be done better and created a search engine as a research project. That search engine was called Google.
May 29, 2017 • 20min
Fractional Dimensions
We chat about fractional dimensions, and what the actual heck those are.
May 22, 2017 • 22min
Things You Learn When Building Models for Big Data
As more and more data gets collected seemingly every day, and data scientists use that data for modeling, the technical limits associated with machine learning on big datasets keep getting pushed back. This week is a first-hand case study in using scikit-learn (a popular python machine learning library) on multi-terabyte datasets, which is something that Katie does a lot for her day job at Civis Analytics. There are a lot of considerations for doing something like this--cloud computing, artful use of parallelization, considerations of model complexity, and the computational demands of training vs. prediction, to name just a few.
May 15, 2017 • 18min
How to Find New Things to Learn
If you're anything like us, you a) always are curious to learn more about data science and machine learning and stuff, and b) are usually overwhelmed by how much content is out there (not all of it very digestible). We hope this podcast is a part of the solution for you, but if you're looking to go farther (who isn't?) then we have a few new resources that are presenting high-quality content in a fresh, accessible way. Boring old PDFs full of inscrutable math notation, your days are numbered!
May 8, 2017 • 14min
Federated Learning
As machine learning makes its way into more and more mobile devices, an interesting question presents itself: how can we have an algorithm learn from training data that's being supplied as users interact with the algorithm? In other words, how do we do machine learning when the training dataset is distributed across many devices, imbalanced, and the usage associated with any one user needs to be obscured somewhat to protect the privacy of that user? Enter Federated Learning, a set of related algorithms from Google that are designed to help out in exactly this scenario. If you've used keyboard shortcuts or autocomplete on an Android phone, chances are you've encountered Federated Learning even if you didn't know it.
May 1, 2017 • 18min
Word2Vec
Word2Vec is probably the go-to algorithm for vectorizing text data these days. Which makes sense, because it is wicked cool. Word2Vec has it all: neural networks, skip-grams and bag-of-words implementations, a multiclass classifier that gets swapped out for a binary classifier, made-up dummy words, and a model that isn't actually used to predict anything (usually). And all that's before we get to the part about how Word2Vec allows you to do algebra with text. Seriously, this stuff is cool.
Apr 24, 2017 • 17min
Feature Processing for Text Analytics
It seems like every day there's more and more machine learning problems that involve learning on text data, but text itself makes for fairly lousy inputs to machine learning algorithms. That's why there are text vectorization algorithms, which re-format text data so it's ready for using for machine learning. In this episode, we'll go over some of the most common and useful ways to preprocess text data for machine learning.
Apr 17, 2017 • 21min
Education Analytics
This week we'll hop into the rapidly developing industry around predictive analytics for education. For many of the students who eventually drop out, data science is showing that there might be early warning signs that the student is in trouble--we'll talk about what some of those signs are, and then dig into the meatier questions around discrimination, who owns a student's data, and correlation vs. causation. Spoiler: we have more questions than we have answers on this one.
Bonus appearance from Maeby the dog, who isn't a data scientist but does like to steal food off the counter.


