Data Skeptic

Kyle Polich

+73 FANS

The Data Skeptic Podcast features interviews and discussion of topics related to data science, statistics, machine learning, artificial intelligence and the like, all from the perspective of applying critical thinking and the scientific method to eva...Show More

Looking for recently uploaded episodes
AI Decision-Making

42:59 | Feb 16th, 2018

Making a decision is a complex task. Today's guest Dongho Kim discusses how he and his team at Prowler has been building a platform that will be accessible by way of APIs and a set of pre-made scripts for autonomous decision making based on probabili...Show More

Artificial Intelligence, a Podcast Approach

33:17 | Dec 29th, 2017

This episode kicks off the next theme on Data Skeptic: artificial intelligence.  Kyle discusses what's to come for the show in 2018, why this topic is relevant, and how we intend to cover it.

P vs NP

38:48 | Nov 17th, 2017

In this week's episode, host Kyle Polich interviews author Lance Fortnow about whether P will ever be equal to NP and solve all of life’s problems. Fortnow begins the discussion with the example question: Are there 100 people on Facebook who are all ...Show More

[MINI] Big Oh Analysis

18:44 | Oct 13th, 2017

How long an algorithm takes to run depends on many factors including implementation details and hardware.  However, the formal analysis of algorithms focuses on how they will perform in the worst case as the input size grows.  We refer to an algorith...Show More

MINI: Bayesian Belief Networks

17:03 | Aug 4th, 2017

A Bayesian Belief Network is an acyclic directed graph composed of nodes that represent random variables and edges that imply a conditional dependence between them. It's an intuitive way of encoding your statistical knowledge about a system and is ef...Show More

Cross-lingual Short-text Matching

24:43 | Apr 5th

Modern messaging technology has facilitated a trend towards highly compact, short messages send by users who can presume a great amount of context held between the communicating parties.  The rules of grammar may be discarded and often visible errors...Show More


23:49 | Mar 29th

ELMo (Embeddings from Language Models) introduced the idea of deep contextualized word representations. It extends previous ideas like word2vec and GloVe. The ELMo model is a neural network able to map natural language into a vector space. This vecto...Show More


42:23 | Mar 23rd

Bilingual evaluation understudy (or BLEU) is a metric for evaluating the quality of machine translation using human translation as examples of acceptable quality results. This metric has become a widely used standard in the research literature. But i...Show More

Simultaneous Translation at Baidu

24:10 | Mar 15th

While at NeurIPS 2018, Kyle chatted with Liang Huang about his work with Baidu research on simultaneous translation, which was demoed at the conference.

Human vs Machine Transcription

32:43 | Mar 8th

Machine transcription (the process of translating audio recordings of language to text) has come a long way in recent years. But how do the errors made during machine transcription compare to the errors made by a human transcriber? Find out in this e...Show More


21:41 | Mar 1st

A sequence to sequence (or seq2seq) model is neural architecture used for translation (and other tasks) which consists of an encoder and a decoder. The encoder/decoder architecture has obvious promise for machine translation, and has been successfull...Show More

Text Mining in R

20:28 | Feb 22nd

Kyle interviews Julia Silge about her path into data science, her book Text Mining with R, and some of the ways in which she's used natural language processing in projects both personal and professional. Related Links https://stack-survey-2018.glitc...Show More

Recurrent Relational Networks

19:13 | Feb 15th

One of the most challenging NLP tasks is natural language understanding and reasoning. How can we construct algorithms that are able to achieve human level understanding of text and be able to answer general questions about it? This is truly an open ...Show More

Text World and Word Embedding Lower Bounds

39:07 | Feb 8th

In the first half of this episode, Kyle speaks with Marc-Alexandre Côté and Wendy Tay about Text World.  Text World is an engine that simulates text adventure games.  Developers are encouraged to try out their reinforcement learning skills building a...Show More


31:27 | Feb 1st

Word2vec is an unsupervised machine learning model which is able to capture semantic information from the text it is trained on. The model is based on neural networks. Several large organizations like Google and Facebook have trained word embeddings ...Show More

Authorship Attribution

50:37 | Jan 25th

In a recent paper, Leveraging Discourse Information Effectively for Authorship Attribution, authors Su Wang, Elisa Ferracane, and Raymond J. Mooney describe a deep learning methodology for predict which of a collection of authors was the author of a ...Show More

Very Large Corpora and Zipf's Law

24:11 | Jan 18th

The earliest efforts to apply machine learning to natural language tended to convert every token (every word, more or less) into a unique feature. While techniques like stemming may have cut the number of unique tokens down, researchers always had to...Show More

Semantic search at Github

34:57 | Jan 11th

Github is many things besides source control. It's a social network, even though not everyone realizes it. It's a vast repository of code. It's a ticketing and project management system. And of course, it has search as well. In this episode, Kyle int...Show More

Let's Talk About Natural Language Processing

36:13 | Jan 4th

This episode reboots our podcast with the theme of Natural Language Processing for the next few months. We begin with introductions of Yoshi and Linh Da and then get into a broad discussion about natural language processing: what it is, what some of ...Show More

Data Science Hiring Processes

33:05 | Dec 28th, 2018

Kyle shares a few thoughts on mistakes observed by job applicants and also shares a few procedural insights listeners at early stages in their careers might find value in.

Drug Discovery with Machine Learning

28:59 | Dec 21st, 2018

In today's episode, Kyle chats with Alexander Zhebrak, CTO of Insilico Medicine, Inc. Insilico self describes as artificial intelligence for drug discovery, biomarker development, and aging research. The conversation in this episode explores the ways...Show More

Sign Language Recognition

19:46 | Dec 14th, 2018

At the NeurIPS 2018 conference, Stradigi AI premiered a training game which helps players learn American Sign Language. This episode brings the first of many interviews conducted at NeurIPS 2018. In this episode, Kyle interviews Chief Data Scientist ...Show More

Data Ethics

19:51 | Dec 7th, 2018

This week, Kyle interviews Scott Nestler on the topic of Data Ethics. Today, no ubiquitous, formal ethical protocol exists for data science, although some have been proposed. One example is the INFORMS Ethics Guidelines. Guidelines like this are rath...Show More

Escaping the Rabbit Hole

33:49 | Nov 30th, 2018

Kyle interviews Mick West, author of Escaping the Rabbit Hole: How to Debunk Conspiracy Theories Using Facts, Logic, and Respect about the nature of conspiracy theories, the people that believe them, and how to help people escape the belief in false ...Show More

Theorem Provers

18:59 | Nov 23rd, 2018

Fake news attempts to lead readers/listeners/viewers to conclusions that are not descriptions of reality.  They do this most often by presenting false premises, but sometimes by presenting flawed logic. An argument is only sound and valid if the conc...Show More

Automated Fact Checking

31:48 | Nov 16th, 2018

Fake news can be responded to with fact-checking. However, it's easier to create fake news than the fact check it. Full Fact is the UK's independent fact-checking organization. In this episode, Kyle interviews Mevan Babakar, head of automated fact-ch...Show More

Single Source of Truth

29:30 | Nov 9th, 2018

In mathematics, truth is universal.  In data, truth lies in the where clause of the query. As large organizations have grown to rely on their data more significantly for decision making, a common problem is not being able to agree on what the data is...Show More

Detecting Fast Radio Bursts with Deep Learning

44:51 | Nov 2nd, 2018

Fast radio bursts are an astrophysical phenomenon first observed in 2007. While many observations have been made, science has yet to explain the mechanism for these events. This has led some to ask: could it be a form of extra-terrestrial communicati...Show More

Being Bayesian

24:38 | Oct 26th, 2018

This episode explores the root concept of what it is to be Bayesian: describing knowledge of a system probabilistically, having an appropriate prior probability, know how to weigh new evidence, and following Bayes's rule to compute the revised distri...Show More

Modeling Fake News

33:12 | Oct 19th, 2018

This is our interview with Dorje Brody about his recent paper with David Meier, How to model fake news. This paper uses the tools of communication theory and a sub-topic called filtering theory to describe the mathematical basis for an information ch...Show More

The Louvain Method for Community Detection

26:47 | Oct 12th, 2018

Without getting into definitions, we have an intuitive sense of what a "community" is. The Louvain Method for Community Detection is one of the best known mathematical techniques designed to detect communities. This method requires typical graph data...Show More

Cultural Cognition of Scientific Consensus

31:48 | Oct 5th, 2018

In this episode, our guest is Dan Kahan about his research into how people consume and interpret science news. In an era of fake news, motivated reasoning, and alternative facts, important questions need to be asked about how people understand new in...Show More

False Discovery Rates

25:46 | Sep 28th, 2018

A false discovery rate (FDR) is a methodology that can be useful when struggling with the problem of multiple comparisons. In any experiment, if the experimenter checks more than one dependent variable, then they are making multiple comparisons. Natu...Show More

Deep Fakes

30:23 | Sep 21st, 2018

Digital videos can be described as sequences of still images and associated audio. Audio is easy to fake. What about video? A video can easily be broken down into a sequence of still images replayed rapidly in sequence. In this context, videos are si...Show More

Fake News Midterm

19:19 | Sep 14th, 2018

In this episode, Kyle reviews what we've learned so far in our series on Fake News and talks briefly about where we're going next.

Quality Score

18:55 | Sep 7th, 2018

Two weeks ago we discussed click through rates or CTRs and their usefulness and limits as a metric. Today, we discuss a related metric known as quality score. While that phrase has probably been used to mean dozens of different things in different co...Show More

The Knowledge Illusion

40:01 | Aug 31st, 2018

Kyle interviews Steven Sloman, Professor in the school of Cognitive, Linguistic, and Psychological Sciences at Brown University. Steven is co-author of The Knowledge Illusion: Why We Never Think Alone and Causal Models: How People Think about the Wor...Show More

Click Through Rates

31:45 | Aug 24th, 2018

A Click Through Rate (CTR) is the proportion of clicks to impressions of some item of content shared online. This terminology is most commonly used in digital advertising but applies just as well to content websites might choose to feature on their h...Show More

Algorithmic Detection of Fake News

46:26 | Aug 17th, 2018

The scale and frequency with which information can be distributed on social media makes the problem of fake news a rapidly metastasizing issue. To do any content filtering or labeling demands an algorithmic solution. In today's episode, Kyle intervie...Show More

Ant Intelligence

28:17 | Aug 10th, 2018

If you prepared a list of creatures regarded as highly intelligent, it's unlikely ants would make the cut. This is expected, as on an individual level, ants do not generally display behavior that most humans would regard as intelligence. In fact, it ...Show More

Human Detection of Fake News

28:27 | Aug 3rd, 2018

With publications such as "Prior exposure increases perceived accuracy of fake news", "Lazy, not biased: Susceptibility to partisan fake news is better explained by lack of reasoning than by motivated reasoning", and "The science of fake news", Gordo...Show More

Spam Filtering with Naive Bayes

19:45 | Jul 27th, 2018

Today's spam filters are advanced data driven tools. They rely on a variety of techniques to effectively and often seamlessly filter out junk email from good email. Whitelists, blacklists, traffic analysis, network analysis, and a variety of other to...Show More

The Spread of Fake News

45:18 | Jul 20th, 2018

How does fake news get spread online? Its not just a matter of manipulating search algorithms. The social platforms for sharing play a major role in the distribution of fake news. But how significant of an impact can there be? How significantly can b...Show More

Fake News

38:19 | Jul 13th, 2018

This episode kicks off our new theme of "Fake News" with guests Robert Sheaffer and Brad Schwartz. Fake news is a new label for an old idea. For our purposes, we will define fake news information created to deliberately mislead while masquerading as ...Show More

Dev Ops for Data Science

38:20 | Jul 11th, 2018

We revisit the 2018 Microsoft Build in this episode, focusing on the latest ideas in DevOps. Kyle interviews Cloud Developer Advocates Damien Brady, Paige Bailey, and Donovan Brown to talk about DevOps and data science and databases. For a data scien...Show More

First Order Logic

16:51 | Jul 6th, 2018

Logic is a fundamental of mathematical systems. It's roots are the values true and false and it's power is in what it's rules allow you to prove. Prepositional logic provides it's user variables. This episode gets into First Order Logic, an extension...Show More

Blind Spots in Reinforcement Learning

27:35 | Jun 29th, 2018

An intelligent agent trained in a simulated environment may be prone to making mistakes in the real world due to discrepancies between the training and real-world conditions. The areas where an agent makes mistakes are hard to find, known as "blind s...Show More

Defending Against Adversarial Attacks

31:29 | Jun 22nd, 2018

In this week’s episode, our host Kyle interviews Gokula Krishnan from ETH Zurich, about his recent contributions to defenses against adversarial attacks. The discussion centers around his latest paper, titled “Defending Against Adversarial Attacks by...Show More

Transfer Learning

18:04 | Jun 15th, 2018

On a long car ride, Linhda and Kyle record a short episode. This discussion is about transfer learning, a technique using in machine learning to leverage training from one domain to have a head start learning in another domain. Transfer learning has ...Show More

Medical Imaging Training Techniques

25:21 | Jun 8th, 2018

Medical imaging is a highly effective tool used by clinicians to diagnose a wide array of diseases and injuries. However, it often requires exceptionally trained specialists such as radiologists to interpret accurately. In this episode of Data Skepti...Show More

Kalman Filters

21:32 | Jun 1st, 2018

Thanks to our sponsor Galvanize A Kalman Filter is a technique for taking a sequence of observations about an object or variable and determining the most likely current state of that object. In this episode, we discuss it in the context of tracking o...Show More

AI in Industry

43:03 | May 25th, 2018

There's so much to discuss on the AI side, it's hard to know where to begin. Luckily,  Steve Guggenheimer, Microsoft’s corporate vice president of AI Business, and Carlos Pessoa, a software engineering manager for the company’s Cloud AI Platform, tal...Show More

AI in Games

25:58 | May 18th, 2018

Today's interview is with the authors of the textbook Artificial Intelligence and Games.

Game Theory

24:11 | May 11th, 2018

Thanks to our sponsor The Great Courses. This week's episode is a short primer on game theory. For tickets to the free Data Skeptic meetup in Chicago on Tuesday, May 15 at the Mendoza College of Business (224 South Michigan Avenue, Suite 350), click ...Show More

The Experimental Design of Paranormal Claims

27:32 | May 4th, 2018

In this episode of Data Skeptic, Kyle chats with Jerry Schwarz from the Independent Investigations Group (IIG)'s SF Bay Area chapter about testing claims of the paranormal. The IIG is a volunteer-based organization dedicated to investigating paranorm...Show More

Winograd Schema Challenge

36:57 | Apr 27th, 2018

Our guest this week, Hector Levesque, joins us to discuss an alternative way to measure a machine’s intelligence, called Winograd Schemas Challenge. The challenge was proposed as a possible alternative to the Turing test during the 2011 AAAI Spring S...Show More

The Imitation Game

1:00:58 | Apr 20th, 2018

This week on Data Skeptic, we begin with a skit to introduce the topic of this show: The Imitation Game. We open with a scene in the distant future. The year is 2027, and a company called Shamony is announcing their new product, Ada, the most advance...Show More

Eugene Goostman

17:15 | Apr 13th, 2018

In this episode, Kyle shares his perspective on the chatbot Eugene Goostman which (some claim) "passed" the Turing Test. As a second topic Kyle also does an intro of the Winograd Schema Challenge.

The Theory of Formal Languages

23:44 | Apr 6th, 2018

In this episode, Kyle and Linhda discuss the theory of formal languages. Any language can (theoretically) be a formal language. The requirement is that the language can be rigorously described as a set of strings which are considered part of the lang...Show More

The Loebner Prize

33:21 | Mar 30th, 2018

The Loebner Prize is a competition in the spirit of the Turing Test.  Participants are welcome to submit conversational agent software to be judged by a panel of humans.  This episode includes interviews with Charlie Maloney, a judge in the Loebner P...Show More


27:05 | Mar 23rd, 2018

In this episode, Kyle chats with Vince from and Heather Shapiro who works on the Microsoft Bot Framework. We solicit their advice on building a good chatbot both creatively and technically. Our sponsor today is Warby Parker.

The Master Algorithm

46:34 | Mar 16th, 2018

In this week’s episode, Kyle Polich interviews Pedro Domingos about his book, The Master Algorithm: How the quest for the ultimate learning machine will remake our world. In the book, Domingos describes what machine learning is doing for humanity, ho...Show More

The No Free Lunch Theorems

27:25 | Mar 9th, 2018

What's the best machine learning algorithm to use? I hear that XGBoost wins most of the Kaggle competitions that aren't won with deep learning. Should I just use XGBoost all the time? That might work out most of the time in practice, but a proof exis...Show More

ML at Sloan Kettering Cancer Center

38:34 | Mar 2nd, 2018

For a long time, physicians have recognized that the tools they have aren't powerful enough to treat complex diseases, like cancer. In addition to data science and models, clinicians also needed actual products — tools that physicians and researchers...Show More

Optimal Decision Making with POMDPs

18:40 | Feb 23rd, 2018

In a previous episode, we discussed Markov Decision Processes or MDPs, a framework for decision making and planning. This episode explores the generalization Partially Observable MDPs (POMDPs) which are an incredibly general framework that describes ...Show More

[MINI] Reinforcement Learning

23:03 | Feb 9th, 2018

In many real world situations, a person/agent doesn't necessarily know their own objectives or the mechanics of the world they're interacting with. However, if the agent receives rewards which are correlated with the both their actions and the state ...Show More

Evolutionary Computation

24:44 | Feb 2nd, 2018

In this week’s episode, Kyle is joined by Risto Miikkulainen, a professor of computer science and neuroscience at the University of Texas at Austin. They talk about evolutionary computation, its applications in deep learning, and how it’s inspired by...Show More

[MINI] Markov Decision Processes

20:24 | Jan 26th, 2018

Formally, an MDP is defined as the tuple containing states, actions, the transition function, and the reward function. This podcast examines each of these and presents them in the context of simple examples.  Despite MDPs suffering from the curse of...Show More

Neuroscience Frontiers

29:06 | Jan 19th, 2018

Last week on Data Skeptic, we visited the Laboratory of Neuroimaging, or LONI, at USC and learned about their data-driven platform that enables scientists from all over the world to share, transform, store, manage and analyze their data to understand...Show More

Neuroimaging and Big Data

26:37 | Jan 12th, 2018

Last year, Kyle had a chance to visit the Laboratory of Neuroimaging, or LONI, at USC, and learn about how some researchers are using data science to study the function of the brain. We’re going to be covering some of their work in two episodes on Da...Show More

The Agent Model of Artificial Intelligence

17:21 | Jan 5th, 2018

In artificial intelligence, the term 'agent' is used to mean an autonomous, thinking agent with the ability to interact with their environment. An agent could be a person or a piece of software. In either case, we can describe aspects of the agent in...Show More

Holiday reading 2017

12:38 | Dec 22nd, 2017

We break format from our regular programming today and bring you an excerpt from Max Tegmark's book "Life 3.0".  The first chapter is a short story titled "The Tale of the Omega Team".  Audio excerpted courtesy of Penguin Random House Audio from LIFE...Show More

Complexity and Cryptography

35:53 | Dec 15th, 2017

This week, our host Kyle Polich is joined by guest Tim Henderson from Google to talk about the computational complexity foundations of modern cryptography and the complexity issues that underlie the field. A key question that arises during the discus...Show More

Mercedes Benz Machine Learning Research

27:05 | Dec 14th, 2017

This episode features an interview with Rigel Smiroldo recorded at NIPS 2017 in Long Beach California.  We discuss data privacy, machine learning use cases, model deployment, and end-to-end machine learning.

[MINI] Parallel Algorithms

20:37 | Dec 8th, 2017

When computers became commodity hardware and storage became incredibly cheap, we entered the era of so-call "big" data. Most definitions of big data will include something about not being able to process all the data on a single machine. Distributed ...Show More

Quantum Computing

47:49 | Dec 1st, 2017

In this week's episode, Scott Aaronson, a professor at the University of Texas at Austin, explains what a quantum computer is, various possible applications, the types of problems they are good at solving and much more. Kyle and Scott have a lively d...Show More

Azure Databricks

28:27 | Nov 28th, 2017

I sat down with Ali Ghodsi, CEO and found of Databricks, and John Chirapurath, GM for Data Platform Marketing at Microsoft related to the recent announcement of Azure Databricks. When I heard about the announcement, my first thoughts were two-fold.  ...Show More

[MINI] Exponential Time Algorithms

15:55 | Nov 24th, 2017

In this episode we discuss the complexity class of EXP-Time which contains algorithms which require $O(2^{p(n)})$ time to run.  In other words, the worst case runtime is exponential in some polynomial of the input size.  Problems in this class are ev...Show More

[MINI] Sudoku \in NP

18:29 | Nov 10th, 2017

Algorithms with similar runtimes are said to be in the same complexity class. That runtime is measured in the how many steps an algorithm takes relative to the input size. The class P contains all algorithms which run in polynomial time (basically, a...Show More

The Computational Complexity of Machine Learning

47:31 | Nov 3rd, 2017

In this episode, Professor Michael Kearns from the University of Pennsylvania joins host Kyle Polich to talk about the computational complexity of machine learning, complexity in game theory, and algorithmic fairness. Michael's doctoral thesis gave a...Show More

[MINI] Turing Machines

13:54 | Oct 27th, 2017

TMs are a model of computation at the heart of algorithmic analysis.  A Turing Machine has two components.  An infinitely long piece of tape (memory) with re-writable squares and a read/write head which is programmed to change it's state as it proces...Show More

The Complexity of Learning Neural Networks

38:51 | Oct 20th, 2017

Over the past several years, we have seen many success stories in machine learning brought about by deep learning techniques. While the practical success of deep learning has been phenomenal, the formal guarantees have been lacking. Our current theor...Show More

Data science tools and other announcements from Ignite

31:40 | Oct 6th, 2017

In this episode, Microsoft's Corporate Vice President for Cloud Artificial Intelligence, Joseph Sirosh, joins host Kyle Polich to share some of the Microsoft's latest and most exciting innovations in AI development platforms. Last month, Microsoft la...Show More

Generative AI for Content Creation

34:33 | Sep 29th, 2017

Last year, the film development and production company End Cue produced a short film, called Sunspring, that was entirely written by an artificial intelligence using neural networks. More specifically, it was authored by a recurrent neural network (R...Show More

[MINI] One Shot Learning

17:39 | Sep 22nd, 2017

One Shot Learning is the class of machine learning procedures that focuses learning something from a small number of examples.  This is in contrast to "traditional" machine learning which typically requires a very large training set to build a reason...Show More

Recommender Systems Live from FARCON 2017

46:09 | Sep 15th, 2017

Recommender systems play an important role in providing personalized content to online users. Yet, typical data mining techniques are not well suited for the unique challenges that recommender systems face. In this episode, host Kyle Polich joins Dr....Show More

[MINI] Long Short Term Memory

15:29 | Sep 8th, 2017

A Long Short Term Memory (LSTM) is a neural unit, often used in Recurrent Neural Network (RNN) which attempts to provide the network the capacity to store information for longer periods of time. A LSTM unit remembers values for either long or short t...Show More

Zillow Zestimate

37:11 | Sep 1st, 2017

Zillow is a leading real estate information and home-related marketplace. We interviewed Andrew Martin, a data science Research Manager at Zillow, to learn more about how Zillow uses data science and big data to make real estate predictions.

Cardiologist Level Arrhythmia Detection with CNNs

32:05 | Aug 25th, 2017

Our guest Pranav Rajpurkar and his coauthored recently published Cardiologist-Level Arrhythmia Detection with Convolutional Neural Networks, a paper in which they demonstrate the use of Convolutional Neural Networks which outperform board certified c...Show More

[MINI] Recurrent Neural Networks

17:06 | Aug 18th, 2017

RNNs are a class of deep learning models designed to capture sequential behavior.  An RNN trains a set of weights which depend not just on new input but also on the previous state of the neural network.  This directed cycle allows the training phase ...Show More

Project Common Voice

31:14 | Aug 11th, 2017

Thanks to our sponsor Springboard. In this week's episode, guest Andre Natal from Mozilla joins our host, Kyle Polich, to discuss a couple exciting new developments in open source speech recognition systems, which include Project Common Voice. In Jun...Show More


26:59 | Jul 28th, 2017

In this episode, Tony Beltramelli of UIzard Technologies joins our host, Kyle Polich, to talk about the ideas behind his latest app that can transform graphic design into functioning code, as well as his previous work on spying with wearables.

[MINI] Conditional Independence

14:43 | Jul 21st, 2017

In statistics, two random variables might depend on one another (for example, interest rates and new home purchases). We call this conditional dependence. An important related concept exists called conditional independence. This phrase describes situ...Show More

Estimating Sheep Pain with Facial Recognition

27:05 | Jul 14th, 2017

Animals can't tell us when they're experiencing pain, so we have to rely on other cues to help treat their discomfort. But it is often difficult to tell how much an animal is suffering. The sheep, for instance, is the most inscrutable of animals. How...Show More


33:33 | Jul 7th, 2017

This episode collects interviews from my recent trip to Microsoft Build where I had the opportunity to speak with Dharma Shukla and Syam Nair about the recently announced CosmosDB. CosmosDB is a globally consistent, distributed datastore that support...Show More

[MINI] The Vanishing Gradient

15:16 | Jun 30th, 2017

This episode discusses the vanishing gradient - a problem that arises when training deep neural networks in which nearly all the gradients are very close to zero by the time back-propagation has reached the first hidden layer. This makes learning vir...Show More

Doctor AI

41:50 | Jun 23rd, 2017

hen faced with medical issues, would you want to be seen by a human or a machine? In this episode, guest Edward Choi, co-author of the study titled Doctor AI: Predicting Clinical Events via Recurrent Neural Network shares his thoughts. Edward present...Show More

[MINI] Activation Functions

14:11 | Jun 16th, 2017

In a neural network, the output value of a neuron is almost always transformed in some way using a function. A trivial choice would be a linear transformation which can only scale the data. However, other transformations, like a step function allow f...Show More

MS Build 2017

27:37 | Jun 9th, 2017

This episode recaps the Microsoft Build Conference.  Kyle recently attended and shares some thoughts on cloud, databases, cognitive services, and artificial intelligence.  The episode includes interviews with Rohan Kumar and David Carmona.

[MINI] Max-pooling

12:33 | Jun 2nd, 2017

Max-pooling is a procedure in a neural network which has several benefits. It performs dimensionality reduction by taking a collection of neurons and reducing them to a single value for future layers to receive as input. It can also prevent overfitti...Show More

Unsupervised Depth Perception

23:43 | May 26th, 2017

This episode is an interview with Tinghui Zhou.  In the recent paper "Unsupervised Learning of Depth and Ego-motion from Video", Tinghui and collaborators propose a deep learning architecture which is able to learn depth and pose information from unl...Show More

[MINI] Convolutional Neural Networks

14:54 | May 19th, 2017

CNNs are characterized by their use of a group of neurons typically referred to as a filter or kernel.  In image recognition, this kernel is repeated over the entire image.  In this way, CNNs may achieve the property of translational invariance - onc...Show More

Multi-Agent Diverse Generative Adversarial Networks

29:19 | May 12th, 2017

Despite the success of GANs in imaging, one of its major drawbacks is the problem of 'mode collapse,' where the generator learns to produce samples with extremely low variety. To address this issue, today's guests Arnab Ghosh and Viveka Kulharia prop...Show More

[MINI] Generative Adversarial Networks

09:51 | May 5th, 2017

GANs are an unsupervised learning method involving two neural networks iteratively competing. The discriminator is a typical learning system. It attempts to develop the ability to recognize members of a certain class, such as all photos which have bi...Show More

Opinion Polls for Presidential Elections

52:59 | Apr 28th, 2017

Recently, we've seen opinion polls come under some skepticism.  But is that skepticism truly justified?  The recent Brexit referendum and US 2016 Presidential Election are examples where some claims the polls "got it wrong".  This episode explores th...Show More


26:17 | Apr 21st, 2017

No reliable, complete database cataloging home sales data at a transaction level is available for the average person to access. To a data scientist interesting in studying this data, our hands are complete tied. Opportunities like testing sociologica...Show More


11:03 | Apr 14th, 2017

There's more than one type of computer processor. The central processing unit (CPU) is typically what one means when they say "processor". GPUs were introduced to be highly optimized for doing floating point computations in parallel. These types of o...Show More

[MINI] Backpropagation

15:13 | Apr 7th, 2017

Backpropagation is a common algorithm for training a neural network.  It works by computing the gradient of each weight with respect to the overall error, and using stochastic gradient descent to iteratively fine tune the weights of the network.  In ...Show More

Data Science at Patreon

32:23 | Mar 31st, 2017

In this week's episode of Data Skeptic, host Kyle Polich talks with guest Maura Church, Patreon's data science manager. Patreon is a fast-growing crowdfunding platform that allows artists and creators of all kinds build their own subscription content...Show More

[MINI] Feed Forward Neural Networks

15:58 | Mar 24th, 2017

Feed Forward Neural Networks In a feed forward neural network, neurons cannot form a cycle. In this episode, we explore how such a network would be able to represent three common logical operators: OR, AND, and XOR. The XOR operation is the interesti...Show More

Reinventing Sponsored Search Auctions

41:31 | Mar 17th, 2017

In this Data Skeptic episode, Kyle is joined by guest Ruggiero Cavallo to discuss his latest efforts to mitigate the problems presented in this new world of online advertising. Working with his collaborators, Ruggiero reconsiders the search ad alloca...Show More

[MINI] The Perceptron

14:46 | Mar 10th, 2017

Today's episode overviews the perceptron algorithm. This rather simple approach is characterized by a few particular features. It updates its weights after seeing every example, rather than as a batch. It uses a step function as an activation functio...Show More

The Data Refuge Project

24:35 | Mar 3rd, 2017

DataRefuge is a public collaborative, grassroots effort around the United States in which scientists, researchers, computer scientists, librarians and other volunteers are working to download, save, and re-upload government data. The DataRefuge Proje...Show More

[MINI] Automated Feature Engineering

16:14 | Feb 24th, 2017

If a CEO wants to know the state of their business, they ask their highest ranking executives. These executives, in turn, should know the state of the business through reports from their subordinates. This structure is roughly analogous to a process ...Show More

Big Data Tools and Trends

30:45 | Feb 17th, 2017

In this episode, I speak with Raghu Ramakrishnan, CTO for Data at Microsoft.  We discuss services, tools, and developments in the big data sphere as well as the underlying needs that drove these innovations.

[MINI] Primer on Deep Learning

14:28 | Feb 10th, 2017

In this episode, we talk about a high-level description of deep learning.  Kyle presents a simple game (pictured below), which is more of a puzzle really, to try and give  Linh Da the basic concept.     Thanks to our sponsor for this week, the Dat...Show More

Data Provenance and Reproducibility with Pachyderm

40:11 | Feb 3rd, 2017

Versioning isn't just for source code. Being able to track changes to data is critical for answering questions about data provenance, quality, and reproducibility. Daniel Whitenack joins me this week to talk about these concepts and share his work on...Show More

[MINI] Logistic Regression on Audio Data

20:48 | Jan 27th, 2017

Logistic Regression is a popular classification algorithm. In this episode we discuss how it can be used to determine if an audio clip represents one of two given speakers. It assumes an output variable (isLinhda) is a linear combination of available...Show More

Studying Competition and Gender Through Chess

34:27 | Jan 20th, 2017

Prior work has shown that people's response to competition is in part predicted by their gender. Understanding why and when this occurs is important in areas such as labor market outcomes. A well structured study is challenging due to numerous confou...Show More

[MINI] Dropout

15:55 | Jan 13th, 2017

Deep learning can be prone to overfit a given problem. This is especially frustrating given how much time and computational resources are often required to converge. One technique for fighting overfitting is to use dropout. Dropout is the method of r...Show More

The Police Data and the Data Driven Justice Initiatives

49:17 | Jan 6th, 2017

In this episode I speak with Clarence Wardell and Kelly Jin about their mutual service as part of the White House's Police Data Initiative and Data Driven Justice Initiative respectively. The Police Data Initiative was organized to use open data to i...Show More

The Library Problem

35:23 | Dec 30th, 2016

We close out 2016 with a discussion of a basic interview question which might get asked when applying for a data science job. Specifically, how a library might build a model to predict if a book will be returned late or not.

2016 Holiday Special

39:33 | Dec 23rd, 2016

Today's episode is a reading of Isaac Asimov's Franchise.  As mentioned on the show, this is just a work of fiction to be enjoyed and not in any way some obfuscated political statement.  Enjoy, and happy holidays!

[MINI] Entropy

16:36 | Dec 16th, 2016

Classically, entropy is a measure of disorder in a system. From a statistical perspective, it is more useful to say it's a measure of the unpredictability of the system. In this episode we discuss how information reduces the entropy in deciding wheth...Show More

MS Connect Conference

42:23 | Dec 9th, 2016

Cloud services are now ubiquitous in data science and more broadly in technology as well. This week, I speak to Mark Souza, Tobias Ternström, and Corey Sanders about various aspects of data at scale. We discuss the embedding of R into SQLServer, SQLS...Show More

Causal Impact

34:13 | Dec 2nd, 2016

Today's episode is all about Causal Impact, a technique for estimating the impact of a particular event on a time series. We talk to William Martin about his research into the impact releases have on app and we also chat with Karen Blakemore about a ...Show More

[MINI] The Bootstrap

10:37 | Nov 25th, 2016

The Bootstrap is a method of resampling a dataset to possibly refine it's accuracy and produce useful metrics on the result. The bootstrap is a useful statistical technique and is leveraged in Bagging (bootstrap aggregation) algorithms such as Random...Show More

[MINI] Gini Coefficients

15:59 | Nov 18th, 2016

The Gini Coefficient (as it relates to decision trees) is one approach to determining the optimal decision to introduce which splits your dataset as part of a decision tree. To pick the right feature to split on, it considers the frequency of the val...Show More

Unstructured Data for Finance

33:31 | Nov 11th, 2016

Financial analysis techniques for studying numeric, well structured data are very mature. While using unstructured data in finance is not necessarily a new idea, the area is still very greenfield. On this episode,Delia Rusu shares her thoughts on the...Show More

[MINI] AdaBoost

10:39 | Nov 4th, 2016

AdaBoost is a canonical example of the class of AnyBoost algorithms that create ensembles of weak learners. We discuss how a complex problem like predicting restaurant failure (which is surely caused by different problems in different situations) mig...Show More

Stealing Models from the Cloud

37:06 | Oct 28th, 2016

Platform as a service is a growing trend in data science where services like fraud analysis and face detection can be provided via APIs. Such services turn the actual model into a black box to the consumer. But can the model be reverse engineered? Fl...Show More

[MINI] Calculating Feature Importance

13:04 | Oct 21st, 2016

For machine learning models created with the random forest algorithm, there is no obvious diagnostic to inform you which features are more important in the output of the model. Some straightforward but useful techniques exist revolving around removin...Show More

NYC Bike Share Rebalancing

29:39 | Oct 14th, 2016

As cities provide bike sharing services, they must also plan for how to redistribute bicycles as they inevitably build up at more popular destination stations. In this episode, Hui Xiong talks about the solution he and his colleagues developed to re...Show More

[MINI] Random Forest

12:43 | Oct 7th, 2016

Random forest is a popular ensemble learning algorithm which leverages bagging both for sampling and feature selection. In this episode we make an analogy to the process of running a bookstore.

Election Predictions

21:44 | Sep 30th, 2016

Jo Hardin joins us this week to discuss the ASA's Election Prediction Contest. This is a competition aimed at forecasting the results of the upcoming US presidential election competition. More details are available in Jo's blog post found here. You ...Show More

[MINI] F1 Score

09:01 | Sep 23rd, 2016

The F1 score is a model diagnostic that combines precision and recall to provide a singular evaluation for model comparison.  In this episode we discuss how it applies to selecting an interior designer.

Urban Congestion

35:19 | Sep 16th, 2016

Urban congestion effects every person living in a city of any reasonable size. Lewis Lehe joins us in this episode to share his work on downtown congestion pricing. We explore topics of how different pricing mechanisms effect congestion as well as ho...Show More

[MINI] Heteroskedasticity

08:57 | Sep 9th, 2016

Heteroskedasticity is a term used to describe a relationship between two variables which has unequal variance over the range.  For example, the variance in the length of a cat's tail almost certainly changes (grows) with age.  On the other hand, the ...Show More


34:38 | Sep 2nd, 2016

Our guest today is Michael Cuthbert, an associate professor of music at MIT and principal investigator of the Music21 project, which we focus our discussion on today. Music21 is a python library making analysis of music accessible and fun. It support...Show More

[MINI] Paxos

14:43 | Aug 26th, 2016

Paxos is a protocol for arriving a consensus in a distributed computing system which accounts for unreliability of the nodes.  We discuss how this might be used in the real world in the event of a massive disaster.

Trusting Machine Learning Models with LIME

35:16 | Aug 19th, 2016

Machine learning models are often criticized for being black boxes. If a human cannot determine why the model arrives at the decision it made, there's good cause for skepticism. Classic inspection approaches to model interpretability are only useful ...Show More


12:55 | Aug 12th, 2016

Analysis of variance is a method used to evaluate differences between the two or more groups.  It works by breaking down the total variance of the system into the between group variance and within group variance.  We discuss this method in the contex...Show More

Machine Learning on Images with Noisy Human-centric Labels

23:11 | Aug 5th, 2016

When humans describe images, they have a reporting bias, in that the report only what they consider important. Thus, in addition to considering whether something is present in an image, one should consider whether it is also relevant to the image bef...Show More

[MINI] Survival Analysis

14:20 | Jul 29th, 2016

Survival analysis techniques are useful for studying the longevity of groups of elements or individuals, taking into account time considerations and right censorship. This episode explores how survival analysis can describe marriages, in particular, ...Show More

Predictive Models on Random Data

36:32 | Jul 22nd, 2016

This week is an insightful discussion with Claudia Perlich about some situations in machine learning where models can be built, perhaps by well-intentioned practitioners, to appear to be highly predictive despite being trained on random data. Our dis...Show More

[MINI] Receiver Operating Characteristic (ROC) Curve

11:10 | Jul 15th, 2016

An ROC curve is a plot that compares the trade off of true positives and false positives of a binary classifier under different thresholds. The area under the curve (AUC) is useful in determining how discriminating a model is. Together, ROC and AUC a...Show More

Multiple Comparisons and Conversion Optimization

30:02 | Jul 8th, 2016

I'm joined by Chris Stucchio this week to discuss how deliberate or uninformed statistical practitioners can derive spurious and arbitrary results via multiple comparisons. We discuss p-hacking and a variety of other important lessons and tips for pr...Show More

[MINI] Leakage

12:00 | Jul 1st, 2016

If you'd like to make a good prediction, your best bet is to invent a time machine, visit the future, observe the value, and return to the past. For those without access to time travel technology, we need to avoid including information about the futu...Show More

Predictive Policing

36:01 | Jun 24th, 2016

Kristian Lum (@KLdivergence) joins me this week to discuss her work at @hrdag on predictive policing. We also discuss Multiple Systems Estimation, a technique for inferring statistical information about a population from separate sources of observati...Show More

[MINI] The CAP Theorem

10:32 | Jun 17th, 2016

Distributed computing cannot guarantee consistency, accuracy, and partition tolerance. Most system architects need to think carefully about how they should appropriately balance the needs of their application across these competing objectives. Linh D...Show More

Detecting Terrorists with Facial Recognition?

33:10 | Jun 10th, 2016

A startup is claiming that they can detect terrorists purely through facial recognition. In this solo episode, Kyle explores the plausibility of these claims.

[MINI] Goodhart's Law

10:56 | Jun 3rd, 2016

Goodhart's law states that "When a measure becomes a target, it ceases to be a good measure". In this mini-episode we discuss how this affects SEO, call centers, and Scrum.

Data Science at eHarmony

42:43 | May 27th, 2016

I'm joined this week by Jon Morra, director of data science at eHarmony to discuss a variety of ways in which machine learning and data science are being applied to help connect people for successful long term relationships. Interesting open source p...Show More

[MINI] Stationarity and Differencing

13:38 | May 20th, 2016

Mystery shoppers and fruit cultivation help us discuss stationarity - a property of some time serieses that are invariant to time in several ways. Differencing is one approach that can often convert a non-stationary process into a stationary one. If ...Show More


23:04 | May 13th, 2016

I'm joined by Wes McKinney (@wesmckinn) and Hadley Wickham (@hadleywickham) on this episode to discuss their joint project Feather. Feather is a file format for storing data frames along with some metadata, to help with interoperability between langu...Show More

[MINI] Bargaining

15:03 | May 6th, 2016

Bargaining is the process of two (or more) parties attempting to agree on the price for a transaction.  Game theoretic approaches attempt to find two strategies from which neither party is motivated to deviate.  These strategies are said to be in equ...Show More


29:53 | Apr 29th, 2016

Deepjazz is a project from Ji-Sung Kim, a computer science student at Princeton University. It is built using Theano, Keras, music21, and Evan Chow's project jazzml. Deepjazz is a computational music project that creates original jazz compositions us...Show More

[MINI] Auto-correlative functions and correlograms

14:58 | Apr 22nd, 2016

When working with time series data, there are a number of important diagnostics one should consider to help understand more about the data. The auto-correlative function, plotted as a correlogram, helps explain how a given observations relates to rec...Show More

Early Identification of Violent Criminal Gang Members

27:05 | Apr 15th, 2016

This week I spoke with Elham Shaabani and Paulo Shakarian (@PauloShakASU) about their recent paper Early Identification of Violent Criminal Gang Members (also available onarXiv). In this paper, they use social network analysis techniques and machine ...Show More

[MINI] Fractional Factorial Design

11:09 | Apr 8th, 2016

A dinner party at Data Skeptic HQ helps teach the uses of fractional factorial design for studying 2-way interactions.

Machine Learning Done Wrong

25:21 | Apr 1st, 2016

Cheng-tao Chu (@chengtao_chu) joins us this week to discuss his perspective on common mistakes and pitfalls that are made when doing machine learning. This episode is filled with sage advice for beginners and intermediate users of machine learning, a...Show More


41:22 | Mar 25th, 2016

Co-host Linh Da was in a biking accident after hitting a pothole. She sustained an injury that required stitches. This is the story of our quest to file a 311 complaint and track it through the City of Los Angeles's open data portal. My guests this ...Show More

[MINI] The Elbow Method

15:14 | Mar 18th, 2016

Certain data mining algorithms (including k-means clustering and k-nearest neighbors) require a user defined parameter k. A user of these algorithms is required to select this value, which raises the questions: what is the "best" value of k that one ...Show More

Too Good to be True

35:11 | Mar 11th, 2016

Today on Data Skeptic, Lachlan Gunn joins us to discuss his recent paper Too Good to be True. This paper highlights a somewhat paradoxical / counterintuitive fact about how unanimity is unexpected in cases where perfect measurements cannot be taken. ...Show More

[MINI] R-squared

13:20 | Mar 4th, 2016

How well does your model explain your data? R-squared is a useful statistic for answering this question. In this episode we explore how it applies to the problem of valuing a house. Aspects like the number of bedrooms go a long way in explaining why ...Show More

Models of Mental Simulation

39:44 | Feb 26th, 2016

Jessica Hamrick joins us this week to discuss her work studying mental simulation. Her research combines machine learning approaches iwth behavioral method from cognitive science to help explain how people reason and predict outcomes. Her recent pape...Show More

[MINI] Multiple Regression

18:29 | Feb 19th, 2016

This episode is a discussion of multiple regression: the use of observations that are a vector of values to predict a response variable. For this episode, we consider how features of a home such as the number of bedrooms, number of bathrooms, and squ...Show More

Scientific Studies of People's Relationship to Music

42:14 | Feb 12th, 2016

Samuel Mehr joins us this week to share his perspective on why people are musical, where music comes from, and why it works the way it does. We discuss a number of empirical studies related to music and musical cognition, and dispense a few myths abo...Show More

[MINI] k-d trees

14:11 | Feb 5th, 2016

This episode reviews the concept of k-d trees: an efficient data structure for holding multidimensional objects. Kyle gives Linhda a dictionary and asks her to look up words as a way of introducing the concept of binary search. We actually spend most...Show More

Auditing Algorithms

42:58 | Jan 29th, 2016

Algorithms are pervasive in our society and make thousands of automated decisions on our behalf every day. The possibility of digital discrimination is a very real threat, and it is very plausible for discrimination to occur accidentally (i.e. outsid...Show More

[MINI] The Bonferroni Correction

14:29 | Jan 22nd, 2016

Today's episode begins by asking how many left handed employees we should expect to be at a company before anyone should claim left handedness discrimination. If not lefties, let's consider eye color, hair color, favorite ska band, most recent grocer...Show More

Detecting Pseudo-profound BS

37:37 | Jan 15th, 2016

A recent paper in the journal of Judgment and Decision Making titled On the reception and detection of pseudo-profound bullshit explores empirical questions around a reader's ability to detect statements which may sound profound but are actually a co...Show More

[MINI] Gradient Descent

14:51 | Jan 8th, 2016

Today's mini episode discusses the widely known optimization algorithm gradient descent in the context of hiking in a foggy hillside.

Let's Kill the Word Cloud

15:03 | Jan 1st, 2016

This episode is a discussion of data visualization and a proposed New Year's resolution for Data Skeptic listeners. Let's kill the word cloud.

2015 Holiday Special

14:22 | Dec 25th, 2015

Today's episode is a reading of Isaac Asimov's The Machine that Won the War. I can't think of a story that's more appropriate for Data Skeptic.

Wikipedia Revision Scoring as a Service

42:56 | Dec 18th, 2015

In this interview with Aaron Halfaker of the Wikimedia Foundation, we discuss his research and career related to the study of Wikipedia. In his paper The Rise and Decline of an open Collaboration Community, he highlights a trend in the declining rate...Show More

[MINI] Term Frequency - Inverse Document Frequency

10:17 | Dec 11th, 2015

Today's topic is term frequency inverse document frequency, which is a statistic for estimating the importance of words and phrases in a set of documents.

The Hunt for Vulcan

41:31 | Dec 4th, 2015

Early astronomers could see several of the planets with the naked eye. The invention of the telescope allowed for further understanding of our solar system. The work of Isaac Newton allowed later scientists to accurately predict Neptune, which was la...Show More

[MINI] The Accuracy Paradox

17:04 | Nov 27th, 2015

Today's episode discusses the accuracy paradox. There are cases when one might prefer a less accurate model because it yields more predictive power or better captures the underlying causal factors describing the outcome variable you are interested in...Show More

Neuroscience from a Data Scientist's Perspective

40:18 | Nov 20th, 2015

... or should this have been called data science from a neuroscientist's perspective? Either way, I'm sure you'll enjoy this discussion with Laurie Skelly. Laurie earned a PhD in Integrative Neuroscience from the Department of Psychology at the Unive...Show More

[MINI] Bias Variance Tradeoff

13:35 | Nov 13th, 2015

A discussion of the expected number of cars at a stoplight frames today's discussion of the bias variance tradeoff. The central ideal of this concept relates to model complexity. A very simple model will likely generalize well from training to testin...Show More

Big Data Doesn't Exist

32:28 | Nov 6th, 2015

The recent opinion piece Big Data Doesn't Exist on Tech Crunch by Slater Victoroff is an interesting discussion about the usefulness of data both big and small. Slater joins me this episode to discuss and expand on this discussion. Slater Victoroff ...Show More

[MINI] Covariance and Correlation

14:29 | Oct 30th, 2015

The degree to which two variables change together can be calculated in the form of their covariance. This value can be normalized to the correlation coefficient, which has the advantage of transforming it to a unitless measure strictly bounded betwee...Show More

Bayesian A/B Testing

30:11 | Oct 23rd, 2015

Today's guest is Cameron Davidson-Pilon. Cameron has a masters degree in quantitative finance from the University of Waterloo. Think of it as statistics on stock markets. For the last two years he's been the team lead of data science at Shopify. He's...Show More

[MINI] The Central Limit Theorem

13:07 | Oct 16th, 2015

The central limit theorem is an important statistical result which states that typically, the mean of a large enough set of independent trials is approximately normally distributed.  This episode explores how this might be used to determine if an ama...Show More

Accessible Technology

38:44 | Oct 9th, 2015

Today's guest is Chris Hofstader (@gonz_blinko), an accessibility researcher and advocate, as well as an activist for causes such as improving access to information for blind and vision impaired people. His background in computer programming enabled ...Show More

[MINI] Multi-armed Bandit Problems

12:47 | Oct 2nd, 2015

The multi-armed bandit problem is named with reference to slot machines (one armed bandits). Given the chance to play from a pool of slot machines, all with unknown payout frequencies, how can you maximize your reward? If you knew in advance which ma...Show More

Shakespeare, Abiogenesis, and Exoplanets

58:14 | Sep 25th, 2015

Our episode this week begins with a correction. Back in episode 28 (Monkeys on Typewriters), Kyle made some bold claims about the probability that monkeys banging on typewriters might produce the entire works of Shakespeare by chance. The proof shown...Show More

[MINI] Sample Sizes

13:22 | Sep 18th, 2015

There are several factors that are important to selecting an appropriate sample size and dealing with small samples. The most important questions are around representativeness - how well does your sample represent the total population and capture all...Show More

The Model Complexity Myth

30:01 | Sep 11th, 2015

There's an old adage which says you cannot fit a model which has more parameters than you have data. While this is often the case, it's not a universal truth. Today's guest Jake VanderPlas explains this topic in detail and provides some excellent exa...Show More

[MINI] Distance Measures

12:44 | Sep 4th, 2015

There are many occasions in which one might want to know the distance or similarity between two things, for which the means of calculating that distance is not necessarily clear. The distance between two points in Euclidean space is generally straigh...Show More


53:11 | Aug 28th, 2015

ContentMine is a project which provides the tools and workflow to convert scientific literature into machine readable and machine interpretable data in order to facilitate better and more effective access to the accumulated knowledge of human kind. T...Show More

[MINI] Structured and Unstructured Data

13:20 | Aug 21st, 2015

Today's mini-episode explains the distinction between structured and unstructured data, and debates which of these categories best describe recipes.

Measuring the Influence of Fashion Designers

24:42 | Aug 14th, 2015

Yusan Lin shares her research on using data science to explore the fashion industry in this episode. She has applied techniques from data mining, natural language processing, and social network analysis to explore who are the innovators in the fashio...Show More

[MINI] PageRank

08:29 | Aug 7th, 2015

PageRank is the algorithm most famous for being one of the original innovations that made Google stand out as a search engine. It was defined in the classic paper The Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin and Larry Pa...Show More

Data Science at Work in LA County

41:26 | Jul 29th, 2015

In this episode, Benjamin Uminsky enlightens us about some of the ways the Los Angeles County Registrar-Recorder/County Clerk leverages data science and analysis to help be more effective and efficient with the services and expectations they provide ...Show More

[MINI] k-Nearest Neighbors

08:33 | Jul 24th, 2015

This episode explores the k-nearest neighbors algorithm which is an unsupervised, non-parametric method that can be used for both classification and regression. The basica concept is that it leverages some distance function on your dataset to find th...Show More


1:24:42 | Jul 17th, 2015

How do people think rationally about small probability events? What is the optimal statistical process by which one can update their beliefs in light of new evidence? This episode of Data Skeptic explores questions like this as Kyle consults a cast o...Show More

[MINI] MapReduce

12:48 | Jul 10th, 2015

This mini-episode is a high level explanation of the basic idea behind MapReduce, which is a fundamental concept in big data. The origin of the idea comes from a Google paper titled MapReduce: Simplified Data Processing on Large Clusters. This episod...Show More

Genetically Engineered Food and Trends in Herbicide Usage

34:56 | Jul 3rd, 2015

The Credible Hulk joins me in this episode to discuss a recent blog post he wrote about glyphosate and the data about how it's introduction changed the historical usage trends of other herbicides. Links to all the sources and references can be found ...Show More

[MINI] The Curse of Dimensionality

10:57 | Jun 26th, 2015

More features are not always better! With an increasing number of features to consider, machine learning algorithms suffer from the curse of dimensionality, as they have a wider set and often sparser coverage of examples to consider. This episode exp...Show More

Video Game Analytics

31:00 | Jun 19th, 2015

This episode discusses video game analytics with guest Anders Drachen. The way in which people get access to games and the opportunity for game designers to ask interesting questions with data has changed quite a bit in the last two decades. Anders s...Show More

[MINI] Anscombe's Quartet

09:07 | Jun 12th, 2015

This mini-episode discusses Anscombe's Quartet, a series of four datasets which are clearly very different but share some similar statistical properties with one another. For example, each of the four plots has the same mean and variance on both axis...Show More

Proposing Annoyance Mining

30:49 | Jun 9th, 2015

A recent episode of the Skeptics Guide to the Universe included a slight rant by Dr. Novella and the rouges about a shortcoming in operating systems.  This episode explores why such a (seemingly obvious) flaw might make sense from an engineering pers...Show More

Preserving History at Cyark

23:19 | Jun 5th, 2015

Elizabeth Lee from CyArk joins us in this episode to share stories of the work done capturing important historical sites digitally. CyArk is a non-profit focused on using technology to preserve the world's important historic and cultural locations di...Show More

[MINI] A Critical Examination of a Study of Marriage by Political Affiliation

10:24 | May 29th, 2015

Linhda and Kyle review a New York Times article titled How Your Hometown Affects Your Chances of Marriage. This article explores research about what correlates with the likelihood of being married by age 26 by county. Kyle and LinhDa discuss some of ...Show More

Detecting Cheating in Chess

44:35 | May 22nd, 2015

With the advent of algorithms capable of beating highly ranked chess players, the temptation to cheat has emmerged as a potential threat to the integrity of this ancient and complex game. Yet, there are aspects of computer play that are measurably di...Show More

[MINI] z-scores

10:26 | May 15th, 2015

This week's episode dicusses z-scores, also known as standard score. This score describes the distance (in standard deviations) that an observation is away from the mean of the population. A closely related top is the 68-95-99.7 rule which tells us t...Show More

Using Data to Help Those in Crisis

34:47 | May 8th, 2015

This week Noelle Sio Saldana discusses her volunteer work at Crisis Text Line - a 24/7 service that connects anyone with crisis counselors. In the episode we discuss Noelle's career and how, as a participant in the Pivotal for Good program (a partner...Show More

The Ghost in the MP3

35:22 | May 1st, 2015

Have you ever wondered what is lost when you compress a song into an MP3? This week's guest Ryan Maguire did more than that. He worked on software to issolate the sounds that are lost when you convert a lossless digital audio recording into a compres...Show More

Data Fest 2015

27:23 | Apr 28th, 2015

This episode contains converage of the 2015 Data Fest hosted at UCLA.  Data Fest is an analysis competition that gives teams of students 48 hours to explore a new dataset and present novel findings.  This year, data from was provided, and...Show More

[MINI] Cornbread and Overdispersion

15:47 | Apr 24th, 2015

For our 50th episode we enduldge a bit by cooking Linhda's previously mentioned "healthy" cornbread.  This leads to a discussion of the statistical topic of overdispersion in which the variance of some distribution is larger than what one's underlyin...Show More

[MINI] Natural Language Processing

13:27 | Apr 17th, 2015

This episode overviews some of the fundamental concepts of natural language processing including stemming, n-grams, part of speech tagging, and th bag of words approach.

Computer-based Personality Judgments

31:56 | Apr 10th, 2015

Guest Youyou Wu discuses the work she and her collaborators did to measure the accuracy of computer based personality judgments. Using Facebook "like" data, they found that machine learning approaches could be used to estimate user's self assessment ...Show More

[MINI] Markov Chain Monte Carlo

15:50 | Apr 3rd, 2015

This episode explores how going wine testing could teach us about using markov chain monte carlo (mcmc).

[MINI] Markov Chains

11:29 | Mar 20th, 2015

This episode introduces the idea of a Markov Chain. A Markov Chain has a set of states describing a particular system, and a probability of moving from one state to another along every valid connected state. Markov Chains are memoryless, meaning they...Show More

Oceanography and Data Science

33:15 | Mar 13th, 2015

Nicole Goebel joins us this week to share her experiences in oceanography studying phytoplankton and other aspects of the ocean and how data plays a role in that science.   We also discuss Thinkful where Nicole and I are both mentors for the Introd...Show More

[MINI] Ordinary Least Squares Regression

18:07 | Mar 6th, 2015

This episode explores Ordinary Least Squares or OLS - a method for finding a good fit which describes a given dataset.

NYC Speed Camera Analysis with Tim Schmeier

16:56 | Feb 27th, 2015

New York State approved the use of automated speed cameras within a specific range of schools. Tim Schmeier did an analysis of publically available data related to these cameras as part of a project at the NYC Data Science Academy. Tim's work leverag...Show More

[MINI] k-means clustering

14:20 | Feb 20th, 2015

The k-means clustering algorithm is an algorithm that computes a deterministic label for a given "k" number of clusters from an n-dimensional datset.  This mini-episode explores how Yoshi, our lilac crowned amazon's biological processes might be a us...Show More

Shadow Profiles on Social Networks

38:37 | Feb 13th, 2015

Emre Sarigol joins me this week to discuss his paper Online Privacy as a Collective Phenomenon. This paper studies data collected from social networks and how the sharing behaviors of individuals can unintentionally reveal private information about o...Show More

[MINI] The Chi-Squared Test

17:32 | Feb 6th, 2015

The χ2 (Chi-Squared) test is a methodology for hypothesis testing. When one has categorical data, in the form of frequency counts or observations (e.g. Vegetarian, Pescetarian, and Omnivore), split into two or more categories (e.g. Male, Female), a q...Show More

Mapping Reddit Topics with Randy Olson

29:57 | Jan 30th, 2015

My quest this week is noteworthy a.i. researcher Randy Olson who joins me to share his work creating the Reddit World Map - a visualization that illuminates clusters in the reddit community based on user behavior. Randy's blog post on created the re...Show More

[MINI] Partially Observable State Spaces

12:45 | Jan 23rd, 2015

When dealing with dynamic systems that are potentially undergoing constant change, its helpful to describe what "state" they are in.  In many applications the manner in which the state changes from one to another is not completely predictable, thus, ...Show More

Easily Fooling Deep Neural Networks

28:25 | Jan 16th, 2015

My guest this week is Anh Nguyen, a PhD student at the University of Wyoming working in the Evolving AI lab. The episode discusses the paper Deep Neural Networks are Easily Fooled [pdf] by Anh Nguyen, Jason Yosinski, and Jeff Clune. It describes a pr...Show More

[MINI] Data Provenance

10:56 | Jan 9th, 2015

This episode introduces a high level discussion on the topic of Data Provenance, with more MINI episodes to follow to get into specific topics. Thanks to listener Sara L who wrote in to point out the Data Skeptic Podcast has focused alot about using ...Show More

Doubtful News, Geology, Investigating Paranormal Groups, and Thinking Scientifically with Sharon Hill

31:28 | Jan 3rd, 2015

I had the change to speak with well known Sharon Hill (@idoubtit) for the first episode of 2015. We discuss a number of interesting topics including the contributions Doubtful News makes to getting scientific and skeptical information ranked highly i...Show More

[MINI] Belief in Santa

09:55 | Dec 26th, 2014

In this quick holiday episode, we touch on how one would approach modeling the statistical distribution over the probability of belief in Santa Claus given age.

Economic Modeling and Prediction, Charitable Giving, and a Follow Up with Peter Backus

23:43 | Dec 19th, 2014

Economist Peter Backus joins me in this episode to discuss a few interesting topics. You may recall Linhda and I previously discussed his paper "The Girlfriend Equation" on a recent mini-episode. We start by touching base on this fun paper and get a ...Show More

[MINI] The Battle of the Sexes

18:04 | Dec 12th, 2014

Love and Data is the continued theme in this mini-episode as we discuss the game theory example of The Battle of the Sexes. In this textbook example, a couple must strategize about how to spend their Friday night. One partner prefers football games w...Show More

The Science of Online Data at Plenty of Fish with Thomas Levi

58:46 | Dec 5th, 2014

Can algorithms help you find love? Many happy couples successfully brought together via online dating websites show us that data science can help you find love. I'm joined this week by Thomas Levi, Senior Data Scientist at Plenty of Fish, to discuss ...Show More

[MINI] The Girlfriend Equation

16:11 | Nov 28th, 2014

Economist Peter Backus put forward "The Girlfriend Equation" while working on his PhD - a probabilistic model attempting to estimate the likelihood of him finding a girlfriend. In this mini episode we explore the soundness of his model and also share...Show More

The Secret and the Global Consciousness Project with Alex Boklin

41:45 | Nov 21st, 2014

I'm joined this week by Alex Boklin to explore the topic of magical thinking especially in the context of Rhonda Byrne's "The Secret", and the similarities it bears to The Global Consciousness Project (GCP). The GCP puts forward the hypothesis that r...Show More

[MINI] Monkeys on Typewriters

03:05 | Nov 14th, 2014

What is randomness? How can we determine if some results are randomly generated or not? Why are random numbers important to us in our everyday life? These topics and more are discussed in this mini-episode on random numbers. Many readers will be vag...Show More

Mining the Social Web with Matthew Russell

50:19 | Nov 7th, 2014

This week's episode explores the possibilities of extracting novel insights from the many great social web APIs available. Matthew Russell's Mining the Social Web is a fantastic exploration of the tools and methods, and we explore a few related topic...Show More

[MINI] Is the Internet Secure?

26:11 | Oct 31st, 2014

This episode explores the basis of why we can trust encryption.  Suprisingly, a discussion of looking up a word in the dictionary (binary search) and efficiently going wine tasting (the travelling salesman problem) help introduce computational comple...Show More

Practicing and Communicating Data Science with Jeff Stanton

36:57 | Oct 24th, 2014

Jeff Stanton joins me in this episode to discuss his book An Introduction to Data Science, and some of the unique challenges and issues faced by someone doing applied data science. A challenge to any data scientist is making sure they have a good inp...Show More

[MINI] The T-Test

17:03 | Oct 17th, 2014

The t-test is this week's mini-episode topic. The t-test is a statistical testing procedure used to determine if the mean of two datasets differs by a statistically significant amount. We discuss how a wine manufacturer might apply a t-test to determ...Show More

Data Myths with Karl Mamer

48:29 | Oct 10th, 2014

This week I'm joined by Karl Mamer to discuss the data behind three well known urban legends. Did a large blackout in New York and surrounding areas result in a baby boom nine months later? Do subliminal messages affect our behavior? Is placing beer ...Show More

Contest Announcement

12:18 | Oct 8th, 2014

The Data Skeptic Podcast is launching a contest- not one of chance, but one of skill. Listeners are encouraged to put their data science skills to good use, or if all else fails, guess! The contest works as follows. Below is some data about the cumu...Show More

[MINI] Selection Bias

14:31 | Oct 3rd, 2014

A discussion about conducting US presidential election polls helps frame a converation about selection bias.

[MINI] Confidence Intervals

11:30 | Sep 26th, 2014

Commute times and BBQ invites help frame a discussion about the statistical concept of confidence intervals.

[MINI] Value of Information

14:10 | Sep 19th, 2014

A discussion about getting ready in the morning, negotiating a used car purchase, and selecting the best AirBnB place to stay at help frame a conversation about the decision theoretic principal known as the Value of Information equation.

Game Science Dice with Louis Zocchi

47:28 | Sep 17th, 2014

In this bonus episode, guest Louis Zocchi discusses his background in the gaming industry, specifically, how he became a manufacturer of dice designed to produce statistically uniform outcomes. During the show Louis mentioned a two part video listene...Show More

Data Science at ZestFinance with Marick Sinay

31:25 | Sep 12th, 2014

Marick Sinay from ZestFianance is our guest this weel.  This episode explores how data science techniques are applied in the financial world, specifically in assessing credit worthiness.

[MINI] Decision Tree Learning

13:29 | Sep 5th, 2014

Linhda and Kyle talk about Decision Tree Learning in this miniepisode.  Decision Tree Learning is the algorithmic process of trying to generate an optimal decision tree to properly classify or forecast some future unlabeled element based by following...Show More

Jackson Pollock Authentication Analysis with Kate Jones-Smith

49:49 | Aug 29th, 2014

Our guest this week is Hamilton physics professor Kate Jones-Smith who joins us to discuss the evidence for the claim that drip paintings of Jackson Pollock contain fractal patterns. This hypothesis originates in a paper by Taylor, Micolich, and Jona...Show More

[MINI] Noise!!

16:04 | Aug 22nd, 2014

Our topic for this week is "noise" as in signal vs. noise.  This is not a signal processing discussions, but rather a brief introduction to how the work noise is used to describe how much information in a dataset is useless (as opposed to useful). A...Show More

Guerilla Skepticism on Wikipedia with Susan Gerbic

1:09:59 | Aug 15th, 2014

Our guest this week is Susan Gerbic. Susan is a skeptical activist involved in many activities, the one we focus on most in this episode is Guerrilla Skepticism on Wikipedia, an organization working to improve the content and citations of Wikipedia. ...Show More

[MINI] Ant Colony Optimization

15:07 | Aug 8th, 2014

In this week's mini episode, Linhda and Kyle discuss Ant Colony Optimization - a numerical / stochastic optimization technique which models its search after the process ants employ in using random walks to find a goal (food) and then leaving a pherem...Show More

Data in Healthcare IT with Shahid Shah

57:14 | Aug 1st, 2014

Our guest this week is Shahid Shah. Shahid is CEO at Netspective, and writes three blogs: Health Care Guy, Shahid Shah, and HitSphere - the Healthcare IT Supersite. During the program, Kyle recommended a talk from the 2014 MIT Sloan CIO Symposium en...Show More

[MINI] Cross Validation

0:00 | Jul 25th, 2014

This miniepisode discusses the technique called Cross Validation - a process by which one randomly divides up a dataset into numerous small partitions. Next, (typically) one is held out, and the rest are used to train some model. The hold out set can...Show More

Streetlight Outage and Crime Rate Analysis with Zach Seeskin

33:29 | Jul 18th, 2014

This episode features a discussion with statistics PhD student Zach Seeskin about a project he was involved in as part of the Eric and Wendy Schmidt Data Science for Social Good Summer Fellowship.  The project involved exploring the relationship (if ...Show More

[MINI] Experimental Design

15:43 | Jul 11th, 2014

This episode loosely explores the topic of Experimental Design including hypothesis testing, the importance of statistical tests, and an everyday and business example.

The Right (big data) Tool for the Job with Jay Shankar

49:59 | Jul 7th, 2014

In this week's episode, we discuss applied solutions to big data problem with big data engineer Jay Shankar.  The episode explores approaches and design philosophy to solving real world big data business problems, and the exploration of the wide arra...Show More

[MINI] Bayesian Updating

11:24 | Jun 27th, 2014

In this minisode, we discuss Bayesian Updating - the process by which one can calculate the most likely hypothesis might be true given one's older / prior belief and all new evidence.

Personalized Medicine with Niki Athanasiadou

57:14 | Jun 20th, 2014

In the second full length episode of the podcast, we discuss the current state of personalized medicine and the advancements in genetics that have made it possible.

[MINI] p-values

16:36 | Jun 13th, 2014

In this mini, we discuss p-values and their use in hypothesis testing, in the context of an hypothetical experiment on plant flowering, and end with a reference to the Particle Fever documentary and how statistical significance played a role.

Advertising Attribution with Nathan Janos

1:16:29 | Jun 6th, 2014

A conversation with Convertro's Nathan Janos about methodologies used to help advertisers understand the affect each of their marketing efforts (print, SEM, display, skywriting, etc.) contributes to their overall return.

[MINI] type i / type ii errors

11:01 | May 30th, 2014

In this first mini-episode of the Data Skeptic Podcast, we define and discuss type i and type ii errors (a.k.a. false positives and false negatives).


03:56 | May 23rd, 2014

The Data Skeptic Podcast features conversations with topics related to data science, statistics, machine learning, artificial intelligence and the like, all from the perspective of applying critical thinking and the scientific method to evaluate the ...Show More