Linear Digressions
Ben Jaffe and Katie Malone
Linear Digressions is a podcast about machine learning and data science. Machine learning is being used to solve a ton of interesting problems, and to accomplish goals that were out of reach even a few short years ago.
Episodes
Mentioned books
May 16, 2015 • 17min
Careers in Data Science
Let’s talk money. As a “hot” career right now, data science can pay pretty well. But for an individual person matched with a specific job or industry, how much should someone expect to make?
Since Katie was on the job market lately, this was something she’s been researching, and it turns out that data science itself (in particular linear regressions) has some answers.
In this episode, we go through a survey of hundreds of data scientists, who report on their job duties, industry, skills, education, location, etc. along with their salaries, and then talk about how this data was fed into a linear regression so that you (yes, you!) can use the patterns in the data to know what kind of salary any particular kind of data scientist might expect.
May 14, 2015 • 3min
That's "Dr Katie" to You
Katie successfully defended her thesis! We celebrate her return, and talk a bit about what getting a PhD in Physics is like.
May 11, 2015 • 11min
Neural Nets (Part 2)
In the last episode, we zipped through neural nets and got a quick idea of how they work and why they can be so powerful. Here’s the real payoff of that work:
In this episode, we’ll talk about a brand-new pair of results, one from Stanford and one from Google, that use neural nets to perform automated picture captioning. One neural net does the object and relationship recognition of the image, a second neural net handles the natural language processing required to express that in an English sentence, and when you put them together you get an automated captioning tool. Two heads are better than one indeed...
May 1, 2015 • 9min
Neural Nets (Part 1)
There is no known learning algorithm that is more flexible and powerful than the human brain. That's quite inspirational, if you think about it--to level up machine learning, maybe we should be going back to biology and letting millions of year of evolution guide the structure of our algorithms.
This is the idea behind neural nets, which mock up the structure of the brain and are some of the most studied and powerful algorithms out there. In this episode, we’ll lay out the building blocks of the neural net (called neurons, naturally) and the networks that are built out of them.
We’ll also explore the results that neural nets get when used to do object recognition in photographs.
Apr 28, 2015 • 14min
Inferring Authorship (Part 2)
Now that we’re up to speed on the classic author ID problem (who wrote the unsigned Federalist Papers?), we move onto a couple more contemporary examples.
First, J.K. Rowling was famously outed using computational linguistics (and Twitter) when she wrote a book under the pseudonym Robert Galbraith.
Second, we’ll talk about a mystery that still endures--who is Satoshi Nakamoto? Satoshi is the mysterious person (or people) behind an extremely lucrative cryptocurrency (aka internet money) called Bitcoin; no one knows who he, she or they are, but we have plenty of writing samples in the form of whitepapers and Bitcoin forum posts. We’ll discuss some attempts to link Satoshi Nakamoto with a cryptocurrency expert and computer scientist named Nick Szabo; the links are tantalizing, but not a smoking gun. “Who is Satoshi” remains an example of attempted author identification where the threads are tangled, the conclusions inconclusive and the stakes high.
6 snips
Apr 16, 2015 • 9min
Inferring Authorship (Part 1)
This episode is inspired by one of our projects for Intro to Machine Learning: given a writing sample, can you use machine learning to identify who wrote it? Turns out that the answer is yes, a person’s writing style is as distinctive as their vocal inflection or their gait when they walk.
By tracing the vocabulary used in a given piece, and comparing the word choices to the word choices in writing samples where we know the author, it can be surprisingly clear who is the more likely author of a given piece of text.
We’ll use a seminal paper from the 1960’s as our example here, where the Naive Bayes algorithm was used to determine whether Alexander Hamilton or James Madison was the more likely author of a number of anonymous Federalist Papers.
Apr 6, 2015 • 13min
Statistical Mistakes and the Challenger Disaster
After the Challenger exploded in 1986, killing all 7 astronauts aboard, an investigation into the cause was immediately launched.
In the cold temperatures the night before the launch, the o-rings that seal off the fuel tanks from the rocket boosters became inflexible, so they did not seal properly, which led to the fuel tank explosion. NASA knew that there could be o-ring problems, but performed the analysis of their data incorrectly and ended up massively underestimating the risk associated with the cold temperatures.
In this episode, we'll unpack the mistakes they made. We'll talk about how they excluded data points that they thought were irrelevant but which actually were critical to recognizing a fatal pattern.
Mar 25, 2015 • 15min
Genetics and Um Detection (HMM Part 2)
In part two of our series on Hidden Markov Models (HMMs), we talk to Katie and special guest Francesco about more useful and novel applications of HMMs. We revisit Katie's "Um Detector," and hear about how HMMs are used in genetics research.
Mar 24, 2015 • 15min
Introducing Hidden Markov Models (HMM Part 1)
Wikipedia says, "A hidden Markov model (HMM) is a statistical Markov model in which the system being modeled is assumed to be a Markov process with unobserved (hidden) states." What does that even mean?
In part one of a special two-parter on HMMs, Katie, Ben, and special guest Francesco explain the basics of HMMs, and some simple applications of them in the real world. This episode sets the stage for part two, where we explore the use of HMMs in Modern Genetics, and possibly Katie's "Um Detector."
Mar 12, 2015 • 8min
Monte Carlo For Physicists
This is another physics-centered podcast, about an ML-backed particle identification tool that we use to figure out what kind of particle caused a particular blob in the detector. But in this case, as in many cases, it looks hard at the outset to use ML because we don't have labeled training data. Monte Carlo to the rescue!
Monte Carlo (MC) is fake data that we generate for ourselves, usually following certain sets of rules (often a Markov chain; in physics we generate MC according to the laws of physics as we understand them) and since you generated the event, you "know" what the correct label is.
Of course, it's a lot of work to validate your MC, but the payoff is that then you can use Machine Learning where you never could before.


