
Data Skeptic
The Data Skeptic Podcast features interviews and discussion of topics related to data science, statistics, machine learning, artificial intelligence and the like, all from the perspective of applying critical thinking and the scientific method to evaluate the veracity of claims and efficacy of approaches.
Latest episodes

Apr 24, 2015 • 16min
[MINI] Cornbread and Overdispersion
For our 50th episode we enduldge a bit by cooking Linhda's previously mentioned "healthy" cornbread. This leads to a discussion of the statistical topic of overdispersion in which the variance of some distribution is larger than what one's underlying model will account for.

8 snips
Apr 17, 2015 • 13min
[MINI] Natural Language Processing
This podcast explores the concepts and techniques of natural language processing, including stemming, n-grams, part of speech tagging, and the bag of words approach. It discusses the challenges and applications of training computers to understand and recognize words in sentences and emphasizes the importance of word context and sequences in extracting meaning. The limitations of the 'bag of words' approach are highlighted, and examples are given to demonstrate how word frequency counts can be used to detect similarities between books.

Apr 10, 2015 • 32min
Computer-based Personality Judgments
Guest Youyou Wu discuses the work she and her collaborators did to measure the accuracy of computer based personality judgments. Using Facebook "like" data, they found that machine learning approaches could be used to estimate user's self assessment of the "big five" personality traits: openness, agreeableness, extraversion, conscientiousness, and neuroticism. Interestingly, the computer-based assessments outperformed some of the assessments of certain groups of human beings. Listen to the episode to learn more. The original paper Computer-based personality judgements are more accurate than those made by humansappeared in the January 2015 volume of the Proceedings of the National Academy of Sciences (PNAS). For her benevolent Youyou recommends Private traits and attributes are predictable from digital records of human behavior by Michal Kosinski, David Stillwell, and Thore Graepel. It's a similar paper by her co-authors which looks at demographic traits rather than personality traits. And for her self-serving recommendation, Youyou has a link that I'm very excited about. You can visitApplyMagicSauce.com to see how this model evaluates your personality based on your Facebook like information. I'd love it if listeners participated in this research and shared your perspective on the results via The Data Skeptic Podcast Facebook page. I'm going to be posting mine there for everyone to see.

Apr 3, 2015 • 16min
[MINI] Markov Chain Monte Carlo
Explore how Markov Chain Monte Carlo (MCMC) algorithms can be used to model complex systems and track movement probability. Learn about the application of MCMC in winery popularity and understanding likelihood of visiting wineries. Discover the real-life applications of MCMC in determining probability distributions, advertising placement, and popular routes.

4 snips
Mar 20, 2015 • 11min
[MINI] Markov Chains
This podcast discusses Markov Chains and their applications in various systems including stop lights, text prediction, and bowling. The hosts explore the concept of Markov Chains in daily life and technology, as well as their impact on partially observable state spaces.

Mar 13, 2015 • 33min
Oceanography and Data Science
Nicole Goebel joins us this week to share her experiences in oceanography studying phytoplankton and other aspects of the ocean and how data plays a role in that science. We also discuss Thinkful where Nicole and I are both mentors for the Introduction to Data Science course. Last but not least, check out Nicole's blog Data Science Girl and the videos Kyle mentioned on her Youtube channel featuring one on the diversity of phytoplankton and how that changes in time and space.

Mar 6, 2015 • 18min
[MINI] Ordinary Least Squares Regression
The podcast explores Ordinary Least Squares regression, discussing the concept of regression and fitting models, making a YouTube video for a healthy cornbread recipe and discussing an ice cream recipe, controlling variables in an ice cream experiment, and exploring linear relationships in regression analysis.

Feb 27, 2015 • 17min
NYC Speed Camera Analysis with Tim Schmeier
New York State approved the use of automated speed cameras within a specific range of schools. Tim Schmeier did an analysis of publically available data related to these cameras as part of a project at the NYC Data Science Academy. Tim's work leverages several open data sets to ask the questions: are the speed cameras succeeding in their intended purpose of increasing public safety near schools? What he found using open data may surprise you. You can read Tim's write up titled Speed Cameras: Revenue or Public Safety? on the NYC Data Science Academy blog. His original write up, reproducible analysis, and figures are a great compliment to this episode. For his benevolent recommendation, Tim suggests listeners visit Maddie's Fund - a data driven charity devoted to helping achieve and sustain a no-kill pet nation. And for his self-serving recommendation, Tim Schmeier will very shortly be on the job market. If you, your employeer, or someone you know is looking for data science talent, you can reach time at his gmail account which is timothy.schmeier at gmail dot com.

10 snips
Feb 20, 2015 • 14min
[MINI] k-means clustering
The podcast discusses the k-means clustering algorithm and its objective of grouping data points into clusters without guidance. It explores tracking animal movements and customer segmentation using k-means clustering. The concept of clusters and centroids is explained, along with classifying new data points. The chapter covers accuracy, precision, and trade-offs in k-means clustering. Lastly, it explores clusters, head positioning, data visualization, and the application of k-means clustering in the workplace.

Feb 13, 2015 • 39min
Shadow Profiles on Social Networks
Emre Sarigol joins me this week to discuss his paper Online Privacy as a Collective Phenomenon. This paper studies data collected from social networks and how the sharing behaviors of individuals can unintentionally reveal private information about other people, including those that have not even joined the social network! For the specific test discussed, the researchers were able to accurately predict the sexual orientation of individuals, even when this information was withheld during the training of their algorithm. The research produces a surprisingly accurate predictor of this private piece of information, and was constructed only with publically available data from myspace.com found on archive.org. As Emre points out, this is a small shadow of the potential information available to modern social networks. For example, users that install the Facebook app on their mobile phones are (perhaps unknowningly) sharing all their phone contacts. Should a social network like Facebook choose to do so, this information could be aggregated to assemble "shadow profiles" containing rich data on users who may not even have an account.