Linear Digressions
Ben Jaffe and Katie Malone
Linear Digressions is a podcast about machine learning and data science. Machine learning is being used to solve a ton of interesting problems, and to accomplish goals that were out of reach even a few short years ago.
Episodes
Mentioned books
Nov 4, 2018 • 20min
Optimized Optimized Web Crawling
Last week’s episode, about methods for optimized web crawling logic, left off on a bit of a cliffhanger: the data scientists had found a solution to the problem, but it wasn’t something that the engineers (who own the search codebase, remember) liked very much. It was black-boxy, hard to parallelize, and introduced a lot of complexity to their code. This episode takes a second crack, where we formulate the problem a little differently and end up with a different, arguably more elegant solution.
Relevant links:
http://www.unofficialgoogledatascience.com/2018/07/by-bill-richoux-critical-decisions-are.html
http://www.csc.kth.se/utbildning/kth/kurser/DD3364/Lectures/KKT.pdf
Oct 28, 2018 • 22min
Optimized Web Crawling
Got a fun optimization problem for you this week! It’s a two-for-one: how do you optimize the web crawling logic of an operation like Google search so that the results are, on average, as up-to-date as possible, and how do you optimize your solution of choice so that it’s maintainable by software engineers in a huge distributed system? We’re following an excellent post from the Unofficial Google Data Science blog going through this problem.
Relevant links: http://www.unofficialgoogledatascience.com/2018/07/by-bill-richoux-critical-decisions-are.html
Oct 22, 2018 • 32min
Better Know a Distribution: The Poisson Distribution
The Poisson distribution is a probability distribution function used to for events that happen in time or space. It’s super handy because it’s pretty simple to use and is applicable for tons of things—there are a lot of interesting processes that boil down to “events that happen in time or space.” This episode is a quick introduction to the distribution, and then a focus on two of our favorite applications: using the Poisson distribution to identify supernovas and study army deaths from horse kicks.
Oct 15, 2018 • 20min
Searching for Datasets with Google
If you wanted to find a dataset of jokes, how would you do it? What about a dataset of podcast episodes? If your answer was “I’d try Google,” you might have been disappointed—Google is a great search engine for many types of web data, but it didn’t have any special tools to navigate the particular challenges of, well, dataset data. But all that is different now: Google recently announced Google Dataset Search, an effort to unify metadata tagging around datasets and complementary efforts on the search side to recognize and organize datasets in a way that’s useful and intuitive. So whether you’re an academic looking for an economics or physics or biology dataset, or a big old nerd modeling jokes or analyzing podcasts, there’s an exciting new way for you to find data.
Oct 8, 2018 • 22min
It's our fourth birthday
We started Linear Digressions 4 years ago… this isn’t a technical episode, just two buddies shooting the breeze about something we’ve somehow built together.
Sep 30, 2018 • 25min
Gigantic Searches in Particle Physics
This week, we’re dusting off the ol’ particle physics PhD to bring you an episode about ambitious new model-agnostic searches for new particles happening at CERN. Traditionally, new particles have been discovered by “targeted searches,” where scientists have a hypothesis about the particle they’re looking for and where it might be found. However, with the huge amounts of data coming out of CERN, a new type of broader search algorithm is starting to be deployed. It’s a strategy that casts a very wide net, looking in many different places at the same time, which also introduces all kinds of interesting questions—even a one-in-a-thousand occurrence happens when you’re looking in many thousands of places.
Sep 24, 2018 • 16min
Data Engineering
If you’re a data scientist, you know how important it is to keep your data orderly, clean, moving smoothly between different systems, well-documented… there’s a ton of work that goes into building and maintaining databases and data pipelines. This job, that of owner and maintainer of the data being used for analytics, is often the realm of data engineers. From data extraction, transform and loading procedures to the data storage strategy and even the definitions of key data quantities that serve as focal points for a whole organization, data engineers keep the plumbing of data analytics running smoothly.
Sep 16, 2018 • 19min
Text Analysis for Guessing the NYTimes Op-Ed Author
A very intriguing op-ed was published in the NY Times recently, in which the author (a senior official in the Trump White House) claimed to be a minor saboteur of sorts, acting with his or her colleagues to undermine some of Donald Trump’s worst instincts and tendencies. Pretty stunning, right? So who is the author? It’s a mystery—the op-ed was published anonymously. That hasn’t stopped people from speculating though, and some machine learning on the vocabulary used in the op-ed is one way to get clues.
Sep 9, 2018 • 23min
The Three Types of Data Scientists, and What They Actually Do
If you've been in data science for more than a year or two, chances are you've noticed changes in the field as it's grown and matured. And if you're newer to the field, you may feel like there's a disconnect between lots of different stories about what data scientists should know, or do, or expect from their job. This week, we cover two thought pieces, one that arose from interviews with 35(!) data scientists speaking about what their jobs actually are (and aren't), and one from the head of data science at AirBnb organizing core data science work into three main specialties.
Relevant links:
https://hbr.org/2018/08/what-data-scientists-really-do-according-to-35-data-scientists
https://www.linkedin.com/pulse/one-data-science-job-doesnt-fit-all-elena-grewal
7 snips
Aug 26, 2018 • 27min
Agile Development for Data Scientists, Part 2: Where Modifications Help
Delve into the fascinating blend of agile development and data science! Discover how collaborative practices can significantly enhance project success. Uncover the dangers of rushing projects and how simplicity can lead to better outcomes. Learn about the unique challenges in data science that require flexible management. The importance of peer reviews and weekly wrap-ups shines through as essential steps for improved communication and quality. Transform your approach to data analytics with modified agile techniques!


