
Data Skeptic
The Data Skeptic Podcast features interviews and discussion of topics related to data science, statistics, machine learning, artificial intelligence and the like, all from the perspective of applying critical thinking and the scientific method to evaluate the veracity of claims and efficacy of approaches.
Latest episodes

Aug 28, 2015 • 53min
ContentMine
ContentMine is a project which provides the tools and workflow to convert scientific literature into machine readable and machine interpretable data in order to facilitate better and more effective access to the accumulated knowledge of human kind. The program's founder Peter Murray-Rust joins us this week to discuss ContentMine. Our discussion covers the project, the scientific publication process, copywrite, and several other interesting topics.

Aug 21, 2015 • 13min
[MINI] Structured and Unstructured Data
Today's mini-episode explains the distinction between structured and unstructured data, and debates which of these categories best describe recipes.

Aug 14, 2015 • 25min
Measuring the Influence of Fashion Designers
Yusan Lin shares her research on using data science to explore the fashion industry in this episode. She has applied techniques from data mining, natural language processing, and social network analysis to explore who are the innovators in the fashion world and how their influence effects other designers. If you found this episode interesting and would like to read more, Yusan's papers Text-Generated Fashion Influence Model: An Empirical Study on Style.com and The Hidden Influence Network in the Fashion Industry are worth reading.

Aug 7, 2015 • 8min
[MINI] PageRank
PageRank is the algorithm most famous for being one of the original innovations that made Google stand out as a search engine. It was defined in the classic paper The Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin and Larry Page. While this algorithm clearly impacted web searching, it has also been useful in a variety of other applications. This episode presents a high level description of this algorithm and how it might apply when trying to establish who writes the most influencial academic papers.

Jul 29, 2015 • 41min
Data Science at Work in LA County
In this episode, Benjamin Uminsky enlightens us about some of the ways the Los Angeles County Registrar-Recorder/County Clerk leverages data science and analysis to help be more effective and efficient with the services and expectations they provide citizens. Our topics range from forecasting to predicting the likelihood that people will volunteer to be poll workers. Benjamin recently spoke at Big Data Day LA. Videos have not yet been posted, but you can see the slides from his talk Data Mining Forecasting and BI at the RRCC if this episode has left you hungry to learn more. During the show, Benjamin encouraged any Los Angeles residents who have some time to serve their community consider becoming a pollworker.

Jul 24, 2015 • 9min
[MINI] k-Nearest Neighbors
This episode explores the k-nearest neighbors algorithm which is an unsupervised, non-parametric method that can be used for both classification and regression. The basica concept is that it leverages some distance function on your dataset to find the $k$ closests other observations of the dataset and averaging them to impute an unknown value or unlabelled datapoint.

Jul 17, 2015 • 1h 25min
Crypto
How do people think rationally about small probability events? What is the optimal statistical process by which one can update their beliefs in light of new evidence? This episode of Data Skeptic explores questions like this as Kyle consults a cast of previous guests and experts to try and answer the question "What is the probability, however small, that Bigfoot is real?"

Jul 10, 2015 • 13min
[MINI] MapReduce
This mini-episode is a high level explanation of the basic idea behind MapReduce, which is a fundamental concept in big data. The origin of the idea comes from a Google paper titled MapReduce: Simplified Data Processing on Large Clusters. This episode makes an analogy to tabulating paper voting ballets as a means of helping to explain how and why MapReduce is an important concept.

Jul 3, 2015 • 35min
Genetically Engineered Food and Trends in Herbicide Usage
The Credible Hulk joins me in this episode to discuss a recent blog post he wrote about glyphosate and the data about how it's introduction changed the historical usage trends of other herbicides. Links to all the sources and references can be found in the blog post. In this discussion, we also mention the food babe and Last Thursdayism which may be worth some further reading. Kyle also mentioned the list of ingredients or chemical composition of a banana. Credible Hulk mentioned the Mommy PhD facebook page. An interesting article about Mommy PhD can be found here. Lastly, if you enjoyed the show, please "Like" the Credible Hulk facebook group.

Jun 26, 2015 • 11min
[MINI] The Curse of Dimensionality
This podcast explores the curse of dimensionality in machine learning, using the examples of gas station selection and buying a home. It discusses the challenges of high-dimensional data and the use of dimensionality reduction. The hosts also share their personal preferences in home buying.