Data Skeptic cover image

Data Skeptic

Latest episodes

undefined
Mar 10, 2017 • 15min

[MINI] The Perceptron

Today's episode overviews the perceptron algorithm. This rather simple approach is characterized by a few particular features. It updates its weights after seeing every example, rather than as a batch. It uses a step function as an activation function. It's only appropriate for linearly separable data, and it will converge to a solution if the data meets these criteria. Being a fairly simple algorithm, it can run very efficiently. Although we don't discuss it in this episode, multi-layer perceptron networks are what makes this technique most attractive.
undefined
Mar 3, 2017 • 25min

The Data Refuge Project

DataRefuge is a public collaborative, grassroots effort around the United States in which scientists, researchers, computer scientists, librarians and other volunteers are working to download, save, and re-upload government data. The DataRefuge Project, which is led by the UPenn Program in Environmental Humanities and the Penn Libraries group at University of Pennsylvania, aims to foster resilience in an era of anthropogenic global climate change and raise awareness of how social and political events affect transparency.  
undefined
Feb 24, 2017 • 16min

[MINI] Automated Feature Engineering

If a CEO wants to know the state of their business, they ask their highest ranking executives. These executives, in turn, should know the state of the business through reports from their subordinates. This structure is roughly analogous to a process observed in deep learning, where each layer of the business reports up different types of observations, KPIs, and reports to be interpreted by the next layer of the business. In deep learning, this process can be thought of as automated feature engineering. DNNs built to recognize objects in images may learn structures that behave like edge detectors in the first hidden layer. Proceeding layers learn to compose more abstract features from lower level outputs. This episode explore that analogy in the context of automated feature engineering. Linh Da and Kyle discuss a particular image in this episode. The image included below in the show notes is drawn from the work of Lee, Grosse, Ranganath, and Ng in their paper Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations.  
undefined
Feb 17, 2017 • 31min

Big Data Tools and Trends

In this episode, I speak with Raghu Ramakrishnan, CTO for Data at Microsoft.  We discuss services, tools, and developments in the big data sphere as well as the underlying needs that drove these innovations.
undefined
Feb 10, 2017 • 14min

[MINI] Primer on Deep Learning

In this episode, we talk about a high-level description of deep learning.  Kyle presents a simple game (pictured below), which is more of a puzzle really, to try and give  Linh Da the basic concept.     Thanks to our sponsor for this week, the Data Science Association. Please check out their upcoming Dallas conference at dallasdatascience.eventbrite.com
undefined
Feb 3, 2017 • 40min

Data Provenance and Reproducibility with Pachyderm

Versioning isn't just for source code. Being able to track changes to data is critical for answering questions about data provenance, quality, and reproducibility. Daniel Whitenack joins me this week to talk about these concepts and share his work on Pachyderm. Pachyderm is an open source containerized data lake. During the show, Daniel mentioned the Gopher Data Science github repo as a great resource for any data scientists interested in the Go language. Although we didn't mention it, Daniel also did an interesting analysis on the 2016 world chess championship that complements our recent episode on chess well. You can find that post here Supplemental music is Lee Rosevere's Let's Start at the Beginning.   Thanks to Periscope Data for sponsoring this episode. More about them at periscopedata.com/skeptics      
undefined
Jan 27, 2017 • 21min

[MINI] Logistic Regression on Audio Data

Logistic Regression is a popular classification algorithm. In this episode, we discuss how it can be used to determine if an audio clip represents one of two given speakers. It assumes an output variable (isLinhda) is a linear combination of available features, which are spectral bands in the discussion on this episode.   Keep an eye on the dataskeptic.com blog this week as we post more details about this project.   Thanks to our sponsor this week, the Data Science Association.  Please check out their upcoming conference in Dallas on Saturday, February 18th, 2017 via the link below.   dallasdatascience.eventbrite.com  
undefined
Jan 20, 2017 • 34min

Studying Competition and Gender Through Chess

Prior work has shown that people's response to competition is in part predicted by their gender. Understanding why and when this occurs is important in areas such as labor market outcomes. A well structured study is challenging due to numerous confounding factors. Peter Backus and his colleagues have identified competitive chess as an ideal arena to study the topic. Find out why and what conclusions they reached. Our discussion centers around Gender, Competition and Performance: Evidence from Real Tournaments from Backus, Cubel, Guid, Sanchez-Pages, and Mañas. A summary of their paper can also be found here.  
undefined
Jan 13, 2017 • 16min

[MINI] Dropout

Deep learning can be prone to overfit a given problem. This is especially frustrating given how much time and computational resources are often required to converge. One technique for fighting overfitting is to use dropout. Dropout is the method of randomly selecting some neurons in one's network to set to zero during iterations of learning. The core idea is that each particular input in a given layer is not always available and therefore not a signal that can be relied on too heavily.  
undefined
Jan 6, 2017 • 49min

The Police Data and the Data Driven Justice Initiatives

In this episode I speak with Clarence Wardell and Kelly Jin about their mutual service as part of the White House's Police Data Initiative and Data Driven Justice Initiative respectively. The Police Data Initiative was organized to use open data to increase transparency and community trust as well as to help police agencies use data for internal accountability. The PDI emerged from recommendations made by the Task Force on 21st Century Policing. The Data Driven Justice Initiative was organized to help city, county, and state governments use data-driven strategies to help low-level offenders with mental illness get directed to the right services rather than into the criminal justice system.

The AI-powered Podcast Player

Save insights by tapping your headphones, chat with episodes, discover the best highlights - and more!
App store bannerPlay store banner
Get the app