Linear Digressions

Ben Jaffe and Katie Malone
undefined
Jul 3, 2016 • 19min

Reinforcement Learning for Artificial Intelligence

There’s a ton of excitement about reinforcement learning, a form of semi-supervised machine learning that underpins a lot of today’s cutting-edge artificial intelligence algorithms. Here’s a crash course in the algorithmic machinery behind AlphaGo, and self-driving cars, and major logistical optimization projects—and the robots that, tomorrow, will clean our houses and (hopefully) not take over the world…
undefined
Jun 27, 2016 • 18min

Differential Privacy: how to study people without being weird and gross

Apple wants to study iPhone users' activities and use it to improve performance. Google collects data on what people are doing online to try to improve their Chrome browser. Do you like the idea of this data being collected? Maybe not, if it's being collected on you--but you probably also realize that there is some benefit to be had from the improved iPhones and web browsers. Differential privacy is a set of policies that walks the line between individual privacy and better data, including even some old-school tricks that scientists use to get people to answer embarrassing questions honestly. Relevant links: http://static.googleusercontent.com/media/research.google.com/en//pubs/archive/42852.pdf
undefined
Jun 20, 2016 • 29min

How the sausage gets made

Something a little different in this episode--we'll be talking about the technical plumbing that gets our podcast from our brains to your ears. As it turns out, it's a multi-step bucket brigade process of RSS feeds, links to downloads, and lots of hand-waving when it comes to trying to figure out how many of you (listeners) are out there.
undefined
Jun 13, 2016 • 15min

SMOTE: makin' yourself some fake minority data

Machine learning on imbalanced classes: surprisingly tricky. Many (most?) algorithms tend to just assign the majority class label to all the data and call it a day. SMOTE is an algorithm for manufacturing new minority class examples for yourself, to help your algorithm better identify them in the wild. Relevant links: https://www.jair.org/media/953/live-953-2037-jair.pdf
undefined
Jun 6, 2016 • 18min

Conjoint Analysis: like AB testing, but on steroids

Conjoint analysis is like AB tester, but more bigger more better: instead of testing one or two things, you can test potentially dozens of options. Where might you use something like this? Well, if you wanted to design an entire hotel chain completely from scratch, and to do it in a data-driven way. You'll never look at Courtyard by Marriott the same way again. Relevant link: https://marketing.wharton.upenn.edu/files/?whdmsaction=public:main.file&fileID=466
undefined
May 30, 2016 • 18min

Traffic Metering Algorithms

This episode is for all you (us) traffic nerds--we're talking about the hidden structure underlying traffic on-ramp metering systems. These systems slow down the flow of traffic onto highways so that the highways don't get overloaded with cars and clog up. If you're someone who listens to podcasts while commuting, and especially if your area has on-ramp metering, you'll never look at highway access control the same way again (yeah, we know this is super nerdy; it's also super awesome). Relevant links: http://its.berkeley.edu/sites/default/files/publications/UCB/99/PWP/UCB-ITS-PWP-99-19.pdf http://www.its.uci.edu/~lchu/ramp/Final_report_mou3013.pdf
undefined
May 23, 2016 • 14min

Um Detector 2: The Dynamic Time Warp

One tricky thing about working with time series data, like the audio data in our "um" detector (remember that? because we barely do...), is that sometimes events look really similar but one is a little bit stretched and squeezed relative to the other. Besides having an amazing name, the dynamic time warp is a handy algorithm for aligning two time series sequences that are close in shape, but don't quite line up out of the box. Relevant link: http://www.aaai.org/Papers/Workshops/1994/WS-94-03/WS94-03-031.pdf
undefined
May 16, 2016 • 30min

Inside a Data Analysis: Fraud Hunting at Enron

It's storytime this week--the story, from beginning to end, of how Katie designed and built the main project for Udacity's Intro to Machine Learning class, when she was developing the course. The project was to use email and financial data to hunt for signatures of fraud at Enron, one of the biggest cases of corporate fraud in history; that description makes the project sound pretty clean but getting the data into the right shape, and even doing some dataset merging (that hadn't ever been done before), made this project much more interesting to design than it might appear. Here's the story of what a data analysis like this looks like...from the inside.
undefined
May 9, 2016 • 26min

What's the biggest #bigdata?

Data science and is often mentioned in the same breath as big data. But how big is big data? And who has the biggest big data? CERN? Youtube? ... Something (or someone) else? Relevant link: http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002195
undefined
May 2, 2016 • 21min

Data Contamination

Supervised machine learning assumes that the features and labels used for building a classifier are isolated from each other--basically, that you can't cheat by peeking. Turns out this can be easier said than done. In this episode, we'll talk about the many (and diverse!) cases where label information contaminates features, ruining data science competitions along the way. Relevant links: https://www.researchgate.net/profile/Claudia_Perlich/publication/221653692_Leakage_in_data_mining_Formulation_detection_and_avoidance/links/54418bb80cf2a6a049a5a0ca.pdf

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app