

Data Skeptic
Kyle Polich
The Data Skeptic Podcast features interviews and discussion of topics related to data science, statistics, machine learning, artificial intelligence and the like, all from the perspective of applying critical thinking and the scientific method to evaluate the veracity of claims and efficacy of approaches.
Episodes
Mentioned books

Feb 1, 2019 • 31min
word2vec
Word2vec is an unsupervised machine learning model which is able to capture semantic information from the text it is trained on. The model is based on neural networks. Several large organizations like Google and Facebook have trained word embeddings (the result of word2vec) on large corpora and shared them for others to use. The key algorithmic ideas involved in word2vec is the continuous bag of words model (CBOW). In this episode, Kyle uses excerpts from the 1983 cinematic masterpiece War Games, and challenges Linhda to guess a word Kyle leaves out of the transcript. This is similar to how word2vec is trained. It trains a neural network to predict a hidden word based on the words that appear before and after the missing location.

Jan 25, 2019 • 51min
Authorship Attribution
In a recent paper, Leveraging Discourse Information Effectively for Authorship Attribution, authors Su Wang, Elisa Ferracane, and Raymond J. Mooney describe a deep learning methodology for predict which of a collection of authors was the author of a given document.

Jan 18, 2019 • 24min
Very Large Corpora and Zipf's Law
The earliest efforts to apply machine learning to natural language tended to convert every token (every word, more or less) into a unique feature. While techniques like stemming may have cut the number of unique tokens down, researchers always had to face a problem that was highly dimensional. Naive Bayes algorithm was celebrated in NLP applications because of its ability to efficiently process highly dimensional data. Of course, other algorithms were applied to natural language tasks as well. While different algorithms had different strengths and weaknesses to different NLP problems, an early paper titled Scaling to Very Very Large Corpora for Natural Language Disambiguation popularized one somewhat surprising idea. For many NLP tasks, simply providing a large corpus of examples not only improved accuracy, but it also showed that asymptotically, some algorithms yielded more improvement from working on very, very large corpora. Although not explicitly in about NLP, the noteworthy paper The Unreasonable Effectiveness of Data emphasizes this point further while paying homage to the classic treatise The Unreasonable Effectiveness of Mathematics in the Natural Sciences. In this episode, Kyle shares a few thoughts along these lines with Linh Da. The discussion winds up with a brief introduction to Zipf's law. When applied to natural language, Zipf's law states that the frequency of any given word in a corpus (regardless of language) will be proportional to its rank in the frequency table.

Jan 11, 2019 • 35min
Semantic search at Github
Github is many things besides source control. It's a social network, even though not everyone realizes it. It's a vast repository of code. It's a ticketing and project management system. And of course, it has search as well. In this episode, Kyle interviews Hamel Husain about his research into semantic code search.

Jan 4, 2019 • 36min
Let's Talk About Natural Language Processing
This episode reboots our podcast with the theme of Natural Language Processing for the next few months. We begin with introductions of Yoshi and Linh Da and then get into a broad discussion about natural language processing: what it is, what some of the classic problems are, and just a bit on approaches. Finishing out the show is an interview with Lucy Park about her work on the KoNLPy library for Korean NLP in Python. If you want to share your NLP project, please join our Slack channel. We're eager to see what listeners are working on! http://konlpy.org/en/latest/

Dec 28, 2018 • 33min
Data Science Hiring Processes
Kyle shares a few thoughts on mistakes observed by job applicants and also shares a few procedural insights listeners at early stages in their careers might find value in.

Dec 25, 2018 • 21min
Holiday Reading - Epicac
Epicac by Kurt Vonnegut.

Dec 21, 2018 • 29min
Drug Discovery with Machine Learning
In today's episode, Kyle chats with Alexander Zhebrak, CTO of Insilico Medicine, Inc. Insilico self describes as artificial intelligence for drug discovery, biomarker development, and aging research. The conversation in this episode explores the ways in which machine learning, in particular, deep learning, is contributing to the advancement of drug discovery. This happens not just through research but also through software development. Insilico works on data pipelines and tools like MOSES, a benchmarking platform to support research on machine learning for drug discovery. The MOSES platform provides a standardized benchmarking dataset, a set of open-sourced models with unified implementation, and metrics to evaluate and assess their performance.

Dec 14, 2018 • 20min
Sign Language Recognition
At the NeurIPS 2018 conference, Stradigi AI premiered a training game which helps players learn American Sign Language. This episode brings the first of many interviews conducted at NeurIPS 2018. In this episode, Kyle interviews Chief Data Scientist Carolina Bessega about the deep learning architecture used in this project. The Stradigi AI team was exhibiting a project called the American Sign Language (ASL) Alphabet Game at the recent NeurIPS 2018 conference. They also published a detailed blog post about how they built the system found here.

Dec 7, 2018 • 20min
Data Ethics
This week, Kyle interviews Scott Nestler on the topic of Data Ethics. Today, no ubiquitous, formal ethical protocol exists for data science, although some have been proposed. One example is the INFORMS Ethics Guidelines. Guidelines like this are rather informal compared to other professions, like the Hippocratic Oath. Yet not every profession requires such a formal commitment. In this episode, Scott shares his perspective on a variety of ethical questions specific to data and analytics.