Get the best podcast recommendations in your inbox every week. ūüėé
podcast cover
Science

Data Skeptic

Kyle Polich

+78 FANS
The Data Skeptic Podcast features interviews and discussion of topics related to data science, statistics, machine learning, artificial intelligence and the like, all from the perspective of applying critical thinking and the scientific method to eva...Show More
Best ‚ąô
Newest ‚ąô

42:59 | Feb 16th, 2018

Making a decision is a complex task. Today's guest Dongho Kim discusses how he and his team at Prowler has been building a platform that will be accessible by way of APIs and a set of pre-made scripts for autonomous decision making based on probabili...Show More

33:17 | Dec 29th, 2017

This episode kicks off the next theme on Data Skeptic: artificial intelligence.  Kyle discusses what's to come for the show in 2018, why this topic is relevant, and how we intend to cover it.
Get the best podcast recommendations in your inbox every week. ūüėé

38:48 | Nov 17th, 2017

In this week's episode, host Kyle Polich interviews author Lance Fortnow about whether P will ever be equal to NP and solve all of life’s problems. Fortnow begins the discussion with the example question: Are there 100 people on Facebook who are all ...Show More

18:44 | Oct 13th, 2017

How long an algorithm takes to run depends on many factors including implementation details and hardware.  However, the formal analysis of algorithms focuses on how they will perform in the worst case as the input size grows.  We refer to an algorith...Show More

17:03 | Aug 4th, 2017

A Bayesian Belief Network is an acyclic directed graph composed of nodes that represent random variables and edges that imply a conditional dependence between them. It's an intuitive way of encoding your statistical knowledge about a system and is ef...Show More

18:01 | Sep 16th

Kyle pontificates on how impressed he is with BERT.

21:51 | Sep 6th

Kyle sits down with Jen Stirrup to inquire about her experiences helping companies deploy data science solutions in a variety of different settings.

22:38 | Aug 19th

Video annotation is an expensive and time-consuming process. As a consequence, the available video datasets are useful but small. The availability of machine transcribed explainer videos offers a unique opportunity to rapidly develop a useful, if dir...Show More

13:44 | Jul 29th

Kyle provides a non-technical overview of why Bidirectional Encoder Representations from Transformers (BERT) is a powerful tool for natural language processing projects.

20:32 | Jul 22nd

Kyle interviews Prasanth Pulavarthi about the Onyx format for deep neural networks.

21:27 | Jul 15th

Kyle and Linhda discuss some high level theory of mind and overview the concept machine learning concept of catastrophic forgetting.

29:51 | Jul 8th

Sebastian Ruder is a research scientist at DeepMind.  In this episode, he joins us to discuss the state of the art in transfer learning and his contributions to it.

23:08 | Jun 21st

In 2017, Facebook published a paper called Deal or No Deal? End-to-End Learning for Negotiation Dialogues. In this research, the reinforcement learning agents developed a mechanism of communication (which could be called a language) that made them ab...Show More

16:47 | Jun 15th

Priyanka Biswas joins us in this episode to discuss natural language processing for languages that do not have as many resources as those that are more commonly studied such as English.  Successful NLP projects benefit from the availability of like l...Show More

17:12 | Jun 8th

Kyle and Linh Da discuss the class of approaches called "Named Entity Recognition" or NER.  NER algorithms take any string as input and return a list of "entities" - specific facts and agents in the text along with a classification of the type (e.g. ...Show More

20:19 | Jun 1st

USC students from the CAIS++ student organization have created a variety of novel projects under the mission statement of "artificial intelligence for social good". In this episode, Kyle interviews Zane and Leena about the Endangered Languages Projec...Show More

25:27 | May 25th

Kyle and Linh Da discuss the concepts behind the neural Turing machine.

30:05 | May 18th

Kyle chats with Rohan Kumar about hyperscale, data at the edge, and a variety of other trends in data engineering in the cloud.

23:53 | May 11th

In this episode, Kyle interviews Laura Edell at MS Build 2019.  The conversation covers a number of topics, notably her NCAA Final 4 prediction model.

15:23 | May 3rd

Kyle and Linhda discuss attention and the transformer - an encoder/decoder architecture that extends the basic ideas of vector embeddings like word2vec into a more contextual use case.

25:20 | Apr 26th

When users on Twitter post with geographic tags, it creates the opportunity for a variety of interesting questions to be posed having to do with language, dialects, and location.  In this episode, Kyle interviews Bruno Gonçalves about his work studyi...Show More

27:28 | Apr 20th

This is an interview with Ellen Loeshelle, Director of Product Management at Clarabridge.  We primarily discuss sentiment analysis.

14:51 | Apr 13th

A gentle introduction to the very high-level idea of "attention" in machine learning, as it will play a major role in some upcoming episodes over the next few weeks.

24:43 | Apr 5th

Modern messaging technology has facilitated a trend towards highly compact, short messages send by users who can presume a great amount of context held between the communicating parties.  The rules of grammar may be discarded and often visible errors...Show More

23:49 | Mar 29th

ELMo (Embeddings from Language Models) introduced the idea of deep contextualized word representations. It extends previous ideas like word2vec and GloVe. The ELMo model is a neural network able to map natural language into a vector space. This vecto...Show More

42:23 | Mar 23rd

Bilingual evaluation understudy (or BLEU) is a metric for evaluating the quality of machine translation using human translation as examples of acceptable quality results. This metric has become a widely used standard in the research literature. But i...Show More

24:10 | Mar 15th

While at NeurIPS 2018, Kyle chatted with Liang Huang about his work with Baidu research on simultaneous translation, which was demoed at the conference.

32:43 | Mar 8th

Machine transcription (the process of translating audio recordings of language to text) has come a long way in recent years. But how do the errors made during machine transcription compare to the errors made by a human transcriber? Find out in this e...Show More

21:41 | Mar 1st

A sequence to sequence (or seq2seq) model is neural architecture used for translation (and other tasks) which consists of an encoder and a decoder. The encoder/decoder architecture has obvious promise for machine translation, and has been successfull...Show More

20:28 | Feb 22nd

Kyle interviews Julia Silge about her path into data science, her book Text Mining with R, and some of the ways in which she's used natural language processing in projects both personal and professional. Related Links https://stack-survey-2018.glitc...Show More

19:13 | Feb 15th

One of the most challenging NLP tasks is natural language understanding and reasoning. How can we construct algorithms that are able to achieve human level understanding of text and be able to answer general questions about it? This is truly an open ...Show More

39:07 | Feb 8th

In the first half of this episode, Kyle speaks with Marc-Alexandre C√īt√© and Wendy Tay about Text World.¬† Text World is an engine that simulates text adventure games.¬† Developers are encouraged to try out their reinforcement learning skills building a...Show More

31:27 | Feb 1st

Word2vec is an unsupervised machine learning model which is able to capture semantic information from the text it is trained on. The model is based on neural networks. Several large organizations like Google and Facebook have trained word embeddings ...Show More

50:37 | Jan 25th

In a recent paper, Leveraging Discourse Information Effectively for Authorship Attribution, authors Su Wang, Elisa Ferracane, and Raymond J. Mooney describe a deep learning methodology for predict which of a collection of authors was the author of a ...Show More

24:11 | Jan 18th

The earliest efforts to apply machine learning to natural language tended to convert every token (every word, more or less) into a unique feature. While techniques like stemming may have cut the number of unique tokens down, researchers always had to...Show More

34:57 | Jan 11th

Github is many things besides source control. It's a social network, even though not everyone realizes it. It's a vast repository of code. It's a ticketing and project management system. And of course, it has search as well. In this episode, Kyle int...Show More

36:13 | Jan 4th

This episode reboots our podcast with the theme of Natural Language Processing for the next few months. We begin with introductions of Yoshi and Linh Da and then get into a broad discussion about natural language processing: what it is, what some of ...Show More

33:05 | Dec 28th, 2018

Kyle shares a few thoughts on mistakes observed by job applicants and also shares a few procedural insights listeners at early stages in their careers might find value in.

28:59 | Dec 21st, 2018

In today's episode, Kyle chats with Alexander Zhebrak, CTO of Insilico Medicine, Inc. Insilico self describes as artificial intelligence for drug discovery, biomarker development, and aging research. The conversation in this episode explores the ways...Show More

19:46 | Dec 14th, 2018

At the NeurIPS 2018 conference, Stradigi AI premiered a training game which helps players learn American Sign Language. This episode brings the first of many interviews conducted at NeurIPS 2018. In this episode, Kyle interviews Chief Data Scientist ...Show More

19:51 | Dec 7th, 2018

This week, Kyle interviews Scott Nestler on the topic of Data Ethics. Today, no ubiquitous, formal ethical protocol exists for data science, although some have been proposed. One example is the INFORMS Ethics Guidelines. Guidelines like this are rath...Show More

33:49 | Nov 30th, 2018

Kyle interviews Mick West, author of Escaping the Rabbit Hole: How to Debunk Conspiracy Theories Using Facts, Logic, and Respect about the nature of conspiracy theories, the people that believe them, and how to help people escape the belief in false ...Show More

18:59 | Nov 23rd, 2018

Fake news attempts to lead readers/listeners/viewers to conclusions that are not descriptions of reality.  They do this most often by presenting false premises, but sometimes by presenting flawed logic. An argument is only sound and valid if the conc...Show More

31:48 | Nov 16th, 2018

Fake news can be responded to with fact-checking. However, it's easier to create fake news than the fact check it. Full Fact is the UK's independent fact-checking organization. In this episode, Kyle interviews Mevan Babakar, head of automated fact-ch...Show More

29:30 | Nov 9th, 2018

In mathematics, truth is universal.  In data, truth lies in the where clause of the query. As large organizations have grown to rely on their data more significantly for decision making, a common problem is not being able to agree on what the data is...Show More

44:51 | Nov 2nd, 2018

Fast radio bursts are an astrophysical phenomenon first observed in 2007. While many observations have been made, science has yet to explain the mechanism for these events. This has led some to ask: could it be a form of extra-terrestrial communicati...Show More

24:38 | Oct 26th, 2018

This episode explores the root concept of what it is to be Bayesian: describing knowledge of a system probabilistically, having an appropriate prior probability, know how to weigh new evidence, and following Bayes's rule to compute the revised distri...Show More

33:12 | Oct 19th, 2018

This is our interview with Dorje Brody about his recent paper with David Meier, How to model fake news. This paper uses the tools of communication theory and a sub-topic called filtering theory to describe the mathematical basis for an information ch...Show More

26:47 | Oct 12th, 2018

Without getting into definitions, we have an intuitive sense of what a "community" is. The Louvain Method for Community Detection is one of the best known mathematical techniques designed to detect communities. This method requires typical graph data...Show More

31:48 | Oct 5th, 2018

In this episode, our guest is Dan Kahan about his research into how people consume and interpret science news. In an era of fake news, motivated reasoning, and alternative facts, important questions need to be asked about how people understand new in...Show More

25:46 | Sep 28th, 2018

A false discovery rate (FDR) is a methodology that can be useful when struggling with the problem of multiple comparisons. In any experiment, if the experimenter checks more than one dependent variable, then they are making multiple comparisons. Natu...Show More

30:23 | Sep 21st, 2018

Digital videos can be described as sequences of still images and associated audio. Audio is easy to fake. What about video? A video can easily be broken down into a sequence of still images replayed rapidly in sequence. In this context, videos are si...Show More

19:19 | Sep 14th, 2018

In this episode, Kyle reviews what we've learned so far in our series on Fake News and talks briefly about where we're going next.

18:55 | Sep 7th, 2018

Two weeks ago we discussed click through rates or CTRs and their usefulness and limits as a metric. Today, we discuss a related metric known as quality score. While that phrase has probably been used to mean dozens of different things in different co...Show More

40:01 | Aug 31st, 2018

Kyle interviews Steven Sloman, Professor in the school of Cognitive, Linguistic, and Psychological Sciences at Brown University. Steven is co-author of The Knowledge Illusion: Why We Never Think Alone and Causal Models: How People Think about the Wor...Show More

31:45 | Aug 24th, 2018

A Click Through Rate (CTR) is the proportion of clicks to impressions of some item of content shared online. This terminology is most commonly used in digital advertising but applies just as well to content websites might choose to feature on their h...Show More

46:26 | Aug 17th, 2018

The scale and frequency with which information can be distributed on social media makes the problem of fake news a rapidly metastasizing issue. To do any content filtering or labeling demands an algorithmic solution. In today's episode, Kyle intervie...Show More

28:17 | Aug 10th, 2018

If you prepared a list of creatures regarded as highly intelligent, it's unlikely ants would make the cut. This is expected, as on an individual level, ants do not generally display behavior that most humans would regard as intelligence. In fact, it ...Show More

28:27 | Aug 3rd, 2018

With publications such as "Prior exposure increases perceived accuracy of fake news", "Lazy, not biased: Susceptibility to partisan fake news is better explained by lack of reasoning than by motivated reasoning", and "The science of fake news", Gordo...Show More

19:45 | Jul 27th, 2018

Today's spam filters are advanced data driven tools. They rely on a variety of techniques to effectively and often seamlessly filter out junk email from good email. Whitelists, blacklists, traffic analysis, network analysis, and a variety of other to...Show More

45:18 | Jul 20th, 2018

How does fake news get spread online? Its not just a matter of manipulating search algorithms. The social platforms for sharing play a major role in the distribution of fake news. But how significant of an impact can there be? How significantly can b...Show More

38:19 | Jul 13th, 2018

This episode kicks off our new theme of "Fake News" with guests Robert Sheaffer and Brad Schwartz. Fake news is a new label for an old idea. For our purposes, we will define fake news information created to deliberately mislead while masquerading as ...Show More

38:20 | Jul 11th, 2018

We revisit the 2018 Microsoft Build in this episode, focusing on the latest ideas in DevOps. Kyle interviews Cloud Developer Advocates Damien Brady, Paige Bailey, and Donovan Brown to talk about DevOps and data science and databases. For a data scien...Show More

16:51 | Jul 6th, 2018

Logic is a fundamental of mathematical systems. It's roots are the values true and false and it's power is in what it's rules allow you to prove. Prepositional logic provides it's user variables. This episode gets into First Order Logic, an extension...Show More

27:35 | Jun 29th, 2018

An intelligent agent trained in a simulated environment may be prone to making mistakes in the real world due to discrepancies between the training and real-world conditions. The areas where an agent makes mistakes are hard to find, known as "blind s...Show More

31:29 | Jun 22nd, 2018

In this week’s episode, our host Kyle interviews Gokula Krishnan from ETH Zurich, about his recent contributions to defenses against adversarial attacks. The discussion centers around his latest paper, titled “Defending Against Adversarial Attacks by...Show More

18:04 | Jun 15th, 2018

On a long car ride, Linhda and Kyle record a short episode. This discussion is about transfer learning, a technique using in machine learning to leverage training from one domain to have a head start learning in another domain. Transfer learning has ...Show More

25:21 | Jun 8th, 2018

Medical imaging is a highly effective tool used by clinicians to diagnose a wide array of diseases and injuries. However, it often requires exceptionally trained specialists such as radiologists to interpret accurately. In this episode of Data Skepti...Show More

21:32 | Jun 1st, 2018

Thanks to our sponsor Galvanize A Kalman Filter is a technique for taking a sequence of observations about an object or variable and determining the most likely current state of that object. In this episode, we discuss it in the context of tracking o...Show More

43:03 | May 25th, 2018

There's so much to discuss on the AI side, it's hard to know where to begin. Luckily,  Steve Guggenheimer, Microsoft’s corporate vice president of AI Business, and Carlos Pessoa, a software engineering manager for the company’s Cloud AI Platform, tal...Show More

25:58 | May 18th, 2018

Today's interview is with the authors of the textbook Artificial Intelligence and Games.

24:11 | May 11th, 2018

Thanks to our sponsor The Great Courses. This week's episode is a short primer on game theory. For tickets to the free Data Skeptic meetup in Chicago on Tuesday, May 15 at the Mendoza College of Business (224 South Michigan Avenue, Suite 350), click ...Show More

27:32 | May 4th, 2018

In this episode of Data Skeptic, Kyle chats with Jerry Schwarz from the Independent Investigations Group (IIG)'s SF Bay Area chapter about testing claims of the paranormal. The IIG is a volunteer-based organization dedicated to investigating paranorm...Show More

36:57 | Apr 27th, 2018

Our guest this week, Hector Levesque, joins us to discuss an alternative way to measure a machine’s intelligence, called Winograd Schemas Challenge. The challenge was proposed as a possible alternative to the Turing test during the 2011 AAAI Spring S...Show More

1:00:58 | Apr 20th, 2018

This week on Data Skeptic, we begin with a skit to introduce the topic of this show: The Imitation Game. We open with a scene in the distant future. The year is 2027, and a company called Shamony is announcing their new product, Ada, the most advance...Show More

17:15 | Apr 13th, 2018

In this episode, Kyle shares his perspective on the chatbot Eugene Goostman which (some claim) "passed" the Turing Test. As a second topic Kyle also does an intro of the Winograd Schema Challenge.

23:44 | Apr 6th, 2018

In this episode, Kyle and Linhda discuss the theory of formal languages. Any language can (theoretically) be a formal language. The requirement is that the language can be rigorously described as a set of strings which are considered part of the lang...Show More

33:21 | Mar 30th, 2018

The Loebner Prize is a competition in the spirit of the Turing Test.  Participants are welcome to submit conversational agent software to be judged by a panel of humans.  This episode includes interviews with Charlie Maloney, a judge in the Loebner P...Show More

27:05 | Mar 23rd, 2018

In this episode, Kyle chats with Vince from iv.ai and Heather Shapiro who works on the Microsoft Bot Framework. We solicit their advice on building a good chatbot both creatively and technically. Our sponsor today is Warby Parker.

46:34 | Mar 16th, 2018

In this week’s episode, Kyle Polich interviews Pedro Domingos about his book, The Master Algorithm: How the quest for the ultimate learning machine will remake our world. In the book, Domingos describes what machine learning is doing for humanity, ho...Show More

27:25 | Mar 9th, 2018

What's the best machine learning algorithm to use? I hear that XGBoost wins most of the Kaggle competitions that aren't won with deep learning. Should I just use XGBoost all the time? That might work out most of the time in practice, but a proof exis...Show More

38:34 | Mar 2nd, 2018

For a long time, physicians have recognized that the tools they have aren't powerful enough to treat complex diseases, like cancer. In addition to data science and models, clinicians also needed actual products ‚ÄĒ tools that physicians and researchers...Show More

18:40 | Feb 23rd, 2018

In a previous episode, we discussed Markov Decision Processes or MDPs, a framework for decision making and planning. This episode explores the generalization Partially Observable MDPs (POMDPs) which are an incredibly general framework that describes ...Show More

23:03 | Feb 9th, 2018

In many real world situations, a person/agent doesn't necessarily know their own objectives or the mechanics of the world they're interacting with. However, if the agent receives rewards which are correlated with the both their actions and the state ...Show More

24:44 | Feb 2nd, 2018

In this week’s episode, Kyle is joined by Risto Miikkulainen, a professor of computer science and neuroscience at the University of Texas at Austin. They talk about evolutionary computation, its applications in deep learning, and how it’s inspired by...Show More

20:24 | Jan 26th, 2018

Formally, an MDP is defined as the tuple containing states, actions, the transition function, and the reward function. This podcast examines each of these and presents them in the context of simple examples.  Despite MDPs suffering from the curse of...Show More

29:06 | Jan 19th, 2018

Last week on Data Skeptic, we visited the Laboratory of Neuroimaging, or LONI, at USC and learned about their data-driven platform that enables scientists from all over the world to share, transform, store, manage and analyze their data to understand...Show More

26:37 | Jan 12th, 2018

Last year, Kyle had a chance to visit the Laboratory of Neuroimaging, or LONI, at USC, and learn about how some researchers are using data science to study the function of the brain. We’re going to be covering some of their work in two episodes on Da...Show More

17:21 | Jan 5th, 2018

In artificial intelligence, the term 'agent' is used to mean an autonomous, thinking agent with the ability to interact with their environment. An agent could be a person or a piece of software. In either case, we can describe aspects of the agent in...Show More

12:38 | Dec 22nd, 2017

We break format from our regular programming today and bring you an excerpt from Max Tegmark's book "Life 3.0".  The first chapter is a short story titled "The Tale of the Omega Team".  Audio excerpted courtesy of Penguin Random House Audio from LIFE...Show More

35:53 | Dec 15th, 2017

This week, our host Kyle Polich is joined by guest Tim Henderson from Google to talk about the computational complexity foundations of modern cryptography and the complexity issues that underlie the field. A key question that arises during the discus...Show More

27:05 | Dec 14th, 2017

This episode features an interview with Rigel Smiroldo recorded at NIPS 2017 in Long Beach California.  We discuss data privacy, machine learning use cases, model deployment, and end-to-end machine learning.

20:37 | Dec 8th, 2017

When computers became commodity hardware and storage became incredibly cheap, we entered the era of so-call "big" data. Most definitions of big data will include something about not being able to process all the data on a single machine. Distributed ...Show More

47:49 | Dec 1st, 2017

In this week's episode, Scott Aaronson, a professor at the University of Texas at Austin, explains what a quantum computer is, various possible applications, the types of problems they are good at solving and much more. Kyle and Scott have a lively d...Show More

28:27 | Nov 28th, 2017

I sat down with Ali Ghodsi, CEO and found of Databricks, and John Chirapurath, GM for Data Platform Marketing at Microsoft related to the recent announcement of Azure Databricks. When I heard about the announcement, my first thoughts were two-fold.  ...Show More

15:55 | Nov 24th, 2017

In this episode we discuss the complexity class of EXP-Time which contains algorithms which require $O(2^{p(n)})$ time to run.  In other words, the worst case runtime is exponential in some polynomial of the input size.  Problems in this class are ev...Show More

18:29 | Nov 10th, 2017

Algorithms with similar runtimes are said to be in the same complexity class. That runtime is measured in the how many steps an algorithm takes relative to the input size. The class P contains all algorithms which run in polynomial time (basically, a...Show More

47:31 | Nov 3rd, 2017

In this episode, Professor Michael Kearns from the University of Pennsylvania joins host Kyle Polich to talk about the computational complexity of machine learning, complexity in game theory, and algorithmic fairness. Michael's doctoral thesis gave a...Show More

13:54 | Oct 27th, 2017

TMs are a model of computation at the heart of algorithmic analysis.  A Turing Machine has two components.  An infinitely long piece of tape (memory) with re-writable squares and a read/write head which is programmed to change it's state as it proces...Show More

38:51 | Oct 20th, 2017

Over the past several years, we have seen many success stories in machine learning brought about by deep learning techniques. While the practical success of deep learning has been phenomenal, the formal guarantees have been lacking. Our current theor...Show More

31:40 | Oct 6th, 2017

In this episode, Microsoft's Corporate Vice President for Cloud Artificial Intelligence, Joseph Sirosh, joins host Kyle Polich to share some of the Microsoft's latest and most exciting innovations in AI development platforms. Last month, Microsoft la...Show More

34:33 | Sep 29th, 2017

Last year, the film development and production company End Cue produced a short film, called Sunspring, that was entirely written by an artificial intelligence using neural networks. More specifically, it was authored by a recurrent neural network (R...Show More

17:39 | Sep 22nd, 2017

One Shot Learning is the class of machine learning procedures that focuses learning something from a small number of examples.  This is in contrast to "traditional" machine learning which typically requires a very large training set to build a reason...Show More

46:09 | Sep 15th, 2017

Recommender systems play an important role in providing personalized content to online users. Yet, typical data mining techniques are not well suited for the unique challenges that recommender systems face. In this episode, host Kyle Polich joins Dr....Show More

15:29 | Sep 8th, 2017

A Long Short Term Memory (LSTM) is a neural unit, often used in Recurrent Neural Network (RNN) which attempts to provide the network the capacity to store information for longer periods of time. A LSTM unit remembers values for either long or short t...Show More

37:11 | Sep 1st, 2017

Zillow is a leading real estate information and home-related marketplace. We interviewed Andrew Martin, a data science Research Manager at Zillow, to learn more about how Zillow uses data science and big data to make real estate predictions.

32:05 | Aug 25th, 2017

Our guest Pranav Rajpurkar and his coauthored recently published Cardiologist-Level Arrhythmia Detection with Convolutional Neural Networks, a paper in which they demonstrate the use of Convolutional Neural Networks which outperform board certified c...Show More

17:06 | Aug 18th, 2017

RNNs are a class of deep learning models designed to capture sequential behavior.  An RNN trains a set of weights which depend not just on new input but also on the previous state of the neural network.  This directed cycle allows the training phase ...Show More

31:14 | Aug 11th, 2017

Thanks to our sponsor Springboard. In this week's episode, guest Andre Natal from Mozilla joins our host, Kyle Polich, to discuss a couple exciting new developments in open source speech recognition systems, which include Project Common Voice. In Jun...Show More

26:59 | Jul 28th, 2017

In this episode, Tony Beltramelli of UIzard Technologies joins our host, Kyle Polich, to talk about the ideas behind his latest app that can transform graphic design into functioning code, as well as his previous work on spying with wearables.

14:43 | Jul 21st, 2017

In statistics, two random variables might depend on one another (for example, interest rates and new home purchases). We call this conditional dependence. An important related concept exists called conditional independence. This phrase describes situ...Show More

27:05 | Jul 14th, 2017

Animals can't tell us when they're experiencing pain, so we have to rely on other cues to help treat their discomfort. But it is often difficult to tell how much an animal is suffering. The sheep, for instance, is the most inscrutable of animals. How...Show More

33:33 | Jul 7th, 2017

This episode collects interviews from my recent trip to Microsoft Build where I had the opportunity to speak with Dharma Shukla and Syam Nair about the recently announced CosmosDB. CosmosDB is a globally consistent, distributed datastore that support...Show More

15:16 | Jun 30th, 2017

This episode discusses the vanishing gradient - a problem that arises when training deep neural networks in which nearly all the gradients are very close to zero by the time back-propagation has reached the first hidden layer. This makes learning vir...Show More

41:50 | Jun 23rd, 2017

hen faced with medical issues, would you want to be seen by a human or a machine? In this episode, guest Edward Choi, co-author of the study titled Doctor AI: Predicting Clinical Events via Recurrent Neural Network shares his thoughts. Edward present...Show More

14:11 | Jun 16th, 2017

In a neural network, the output value of a neuron is almost always transformed in some way using a function. A trivial choice would be a linear transformation which can only scale the data. However, other transformations, like a step function allow f...Show More

27:37 | Jun 9th, 2017

This episode recaps the Microsoft Build Conference.  Kyle recently attended and shares some thoughts on cloud, databases, cognitive services, and artificial intelligence.  The episode includes interviews with Rohan Kumar and David Carmona.

12:33 | Jun 2nd, 2017

Max-pooling is a procedure in a neural network which has several benefits. It performs dimensionality reduction by taking a collection of neurons and reducing them to a single value for future layers to receive as input. It can also prevent overfitti...Show More

23:43 | May 26th, 2017

This episode is an interview with Tinghui Zhou.  In the recent paper "Unsupervised Learning of Depth and Ego-motion from Video", Tinghui and collaborators propose a deep learning architecture which is able to learn depth and pose information from unl...Show More

14:54 | May 19th, 2017

CNNs are characterized by their use of a group of neurons typically referred to as a filter or kernel.  In image recognition, this kernel is repeated over the entire image.  In this way, CNNs may achieve the property of translational invariance - onc...Show More

29:19 | May 12th, 2017

Despite the success of GANs in imaging, one of its major drawbacks is the problem of 'mode collapse,' where the generator learns to produce samples with extremely low variety. To address this issue, today's guests Arnab Ghosh and Viveka Kulharia prop...Show More

09:51 | May 5th, 2017

GANs are an unsupervised learning method involving two neural networks iteratively competing. The discriminator is a typical learning system. It attempts to develop the ability to recognize members of a certain class, such as all photos which have bi...Show More

52:59 | Apr 28th, 2017

Recently, we've seen opinion polls come under some skepticism.  But is that skepticism truly justified?  The recent Brexit referendum and US 2016 Presidential Election are examples where some claims the polls "got it wrong".  This episode explores th...Show More

26:17 | Apr 21st, 2017

No reliable, complete database cataloging home sales data at a transaction level is available for the average person to access. To a data scientist interesting in studying this data, our hands are complete tied. Opportunities like testing sociologica...Show More

11:03 | Apr 14th, 2017

There's more than one type of computer processor. The central processing unit (CPU) is typically what one means when they say "processor". GPUs were introduced to be highly optimized for doing floating point computations in parallel. These types of o...Show More

15:13 | Apr 7th, 2017

Backpropagation is a common algorithm for training a neural network.  It works by computing the gradient of each weight with respect to the overall error, and using stochastic gradient descent to iteratively fine tune the weights of the network.  In ...Show More

32:23 | Mar 31st, 2017

In this week's episode of Data Skeptic, host Kyle Polich talks with guest Maura Church, Patreon's data science manager. Patreon is a fast-growing crowdfunding platform that allows artists and creators of all kinds build their own subscription content...Show More

15:58 | Mar 24th, 2017

Feed Forward Neural Networks In a feed forward neural network, neurons cannot form a cycle. In this episode, we explore how such a network would be able to represent three common logical operators: OR, AND, and XOR. The XOR operation is the interesti...Show More

41:31 | Mar 17th, 2017

In this Data Skeptic episode, Kyle is joined by guest Ruggiero Cavallo to discuss his latest efforts to mitigate the problems presented in this new world of online advertising. Working with his collaborators, Ruggiero reconsiders the search ad alloca...Show More

14:46 | Mar 10th, 2017

Today's episode overviews the perceptron algorithm. This rather simple approach is characterized by a few particular features. It updates its weights after seeing every example, rather than as a batch. It uses a step function as an activation functio...Show More

24:35 | Mar 3rd, 2017

DataRefuge is a public collaborative, grassroots effort around the United States in which scientists, researchers, computer scientists, librarians and other volunteers are working to download, save, and re-upload government data. The DataRefuge Proje...Show More

16:14 | Feb 24th, 2017

If a CEO wants to know the state of their business, they ask their highest ranking executives. These executives, in turn, should know the state of the business through reports from their subordinates. This structure is roughly analogous to a process ...Show More

30:45 | Feb 17th, 2017

In this episode, I speak with Raghu Ramakrishnan, CTO for Data at Microsoft.  We discuss services, tools, and developments in the big data sphere as well as the underlying needs that drove these innovations.

14:28 | Feb 10th, 2017

In this episode, we talk about a high-level description of deep learning.  Kyle presents a simple game (pictured below), which is more of a puzzle really, to try and give  Linh Da the basic concept.     Thanks to our sponsor for this week, the Dat...Show More

40:11 | Feb 3rd, 2017

Versioning isn't just for source code. Being able to track changes to data is critical for answering questions about data provenance, quality, and reproducibility. Daniel Whitenack joins me this week to talk about these concepts and share his work on...Show More

20:48 | Jan 27th, 2017

Logistic Regression is a popular classification algorithm. In this episode, we discuss how it can be used to determine if an audio clip represents one of two given speakers. It assumes an output variable (isLinhda) is a linear combination of availabl...Show More

34:27 | Jan 20th, 2017

Prior work has shown that people's response to competition is in part predicted by their gender. Understanding why and when this occurs is important in areas such as labor market outcomes. A well structured study is challenging due to numerous confou...Show More

15:55 | Jan 13th, 2017

Deep learning can be prone to overfit a given problem. This is especially frustrating given how much time and computational resources are often required to converge. One technique for fighting overfitting is to use dropout. Dropout is the method of r...Show More

49:17 | Jan 6th, 2017

In this episode I speak with Clarence Wardell and Kelly Jin about their mutual service as part of the White House's Police Data Initiative and Data Driven Justice Initiative respectively. The Police Data Initiative was organized to use open data to i...Show More

35:23 | Dec 30th, 2016

We close out 2016 with a discussion of a basic interview question which might get asked when applying for a data science job. Specifically, how a library might build a model to predict if a book will be returned late or not.

39:33 | Dec 23rd, 2016

Today's episode is a reading of Isaac Asimov's Franchise.  As mentioned on the show, this is just a work of fiction to be enjoyed and not in any way some obfuscated political statement.  Enjoy, and happy holidays!

16:36 | Dec 16th, 2016

Classically, entropy is a measure of disorder in a system. From a statistical perspective, it is more useful to say it's a measure of the unpredictability of the system. In this episode we discuss how information reduces the entropy in deciding wheth...Show More

42:23 | Dec 9th, 2016

Cloud services are now ubiquitous in data science and more broadly in technology as well. This week, I speak to Mark Souza, Tobias Ternström, and Corey Sanders about various aspects of data at scale. We discuss the embedding of R into SQLServer, SQLS...Show More

34:13 | Dec 2nd, 2016

Today's episode is all about Causal Impact, a technique for estimating the impact of a particular event on a time series. We talk to William Martin about his research into the impact releases have on app and we also chat with Karen Blakemore about a ...Show More

10:37 | Nov 25th, 2016

The Bootstrap is a method of resampling a dataset to possibly refine it's accuracy and produce useful metrics on the result. The bootstrap is a useful statistical technique and is leveraged in Bagging (bootstrap aggregation) algorithms such as Random...Show More

15:59 | Nov 18th, 2016

The Gini Coefficient (as it relates to decision trees) is one approach to determining the optimal decision to introduce which splits your dataset as part of a decision tree. To pick the right feature to split on, it considers the frequency of the val...Show More

33:31 | Nov 11th, 2016

Financial analysis techniques for studying numeric, well structured data are very mature. While using unstructured data in finance is not necessarily a new idea, the area is still very greenfield. On this episode,Delia Rusu shares her thoughts on the...Show More

10:39 | Nov 4th, 2016

AdaBoost is a canonical example of the class of AnyBoost algorithms that create ensembles of weak learners. We discuss how a complex problem like predicting restaurant failure (which is surely caused by different problems in different situations) mig...Show More

37:06 | Oct 28th, 2016

Platform as a service is a growing trend in data science where services like fraud analysis and face detection can be provided via APIs. Such services turn the actual model into a black box to the consumer. But can the model be reverse engineered? Fl...Show More

13:04 | Oct 21st, 2016

For machine learning models created with the random forest algorithm, there is no obvious diagnostic to inform you which features are more important in the output of the model. Some straightforward but useful techniques exist revolving around removin...Show More

29:39 | Oct 14th, 2016

As cities provide bike sharing services, they must also plan for how to redistribute bicycles as they inevitably build up at more popular destination stations. In this episode, Hui Xiong talks about the solution he and his colleagues developed to re...Show More

12:43 | Oct 7th, 2016

Random forest is a popular ensemble learning algorithm which leverages bagging both for sampling and feature selection. In this episode we make an analogy to the process of running a bookstore.

21:44 | Sep 30th, 2016

Jo Hardin joins us this week to discuss the ASA's Election Prediction Contest. This is a competition aimed at forecasting the results of the upcoming US presidential election competition. More details are available in Jo's blog post found here. You ...Show More

09:01 | Sep 23rd, 2016

The F1 score is a model diagnostic that combines precision and recall to provide a singular evaluation for model comparison.  In this episode we discuss how it applies to selecting an interior designer.

35:19 | Sep 16th, 2016

Urban congestion effects every person living in a city of any reasonable size. Lewis Lehe joins us in this episode to share his work on downtown congestion pricing. We explore topics of how different pricing mechanisms effect congestion as well as ho...Show More

08:57 | Sep 9th, 2016

Heteroskedasticity is a term used to describe a relationship between two variables which has unequal variance over the range.  For example, the variance in the length of a cat's tail almost certainly changes (grows) with age.  On the other hand, the ...Show More

34:38 | Sep 2nd, 2016

Our guest today is Michael Cuthbert, an associate professor of music at MIT and principal investigator of the Music21 project, which we focus our discussion on today. Music21 is a python library making analysis of music accessible and fun. It support...Show More

14:43 | Aug 26th, 2016

Paxos is a protocol for arriving a consensus in a distributed computing system which accounts for unreliability of the nodes.  We discuss how this might be used in the real world in the event of a massive disaster.

35:16 | Aug 19th, 2016

Machine learning models are often criticized for being black boxes. If a human cannot determine why the model arrives at the decision it made, there's good cause for skepticism. Classic inspection approaches to model interpretability are only useful ...Show More

12:55 | Aug 12th, 2016

Analysis of variance is a method used to evaluate differences between the two or more groups.  It works by breaking down the total variance of the system into the between group variance and within group variance.  We discuss this method in the contex...Show More

23:11 | Aug 5th, 2016

When humans describe images, they have a reporting bias, in that the report only what they consider important. Thus, in addition to considering whether something is present in an image, one should consider whether it is also relevant to the image bef...Show More

14:20 | Jul 29th, 2016

Survival analysis techniques are useful for studying the longevity of groups of elements or individuals, taking into account time considerations and right censorship. This episode explores how survival analysis can describe marriages, in particular, ...Show More

36:32 | Jul 22nd, 2016

This week is an insightful discussion with Claudia Perlich about some situations in machine learning where models can be built, perhaps by well-intentioned practitioners, to appear to be highly predictive despite being trained on random data. Our dis...Show More

11:10 | Jul 15th, 2016

An ROC curve is a plot that compares the trade off of true positives and false positives of a binary classifier under different thresholds. The area under the curve (AUC) is useful in determining how discriminating a model is. Together, ROC and AUC a...Show More

30:02 | Jul 8th, 2016

I'm joined by Chris Stucchio this week to discuss how deliberate or uninformed statistical practitioners can derive spurious and arbitrary results via multiple comparisons. We discuss p-hacking and a variety of other important lessons and tips for pr...Show More

12:00 | Jul 1st, 2016

If you'd like to make a good prediction, your best bet is to invent a time machine, visit the future, observe the value, and return to the past. For those without access to time travel technology, we need to avoid including information about the futu...Show More

36:01 | Jun 24th, 2016

Kristian Lum (@KLdivergence) joins me this week to discuss her work at @hrdag on predictive policing. We also discuss Multiple Systems Estimation, a technique for inferring statistical information about a population from separate sources of observati...Show More

10:32 | Jun 17th, 2016

Distributed computing cannot guarantee consistency, accuracy, and partition tolerance. Most system architects need to think carefully about how they should appropriately balance the needs of their application across these competing objectives. Linh D...Show More

33:10 | Jun 10th, 2016

A startup is claiming that they can detect terrorists purely through facial recognition. In this solo episode, Kyle explores the plausibility of these claims.

10:56 | Jun 3rd, 2016

Goodhart's law states that "When a measure becomes a target, it ceases to be a good measure". In this mini-episode we discuss how this affects SEO, call centers, and Scrum.

42:43 | May 27th, 2016

I'm joined this week by Jon Morra, director of data science at eHarmony to discuss a variety of ways in which machine learning and data science are being applied to help connect people for successful long term relationships. Interesting open source p...Show More

13:38 | May 20th, 2016

Mystery shoppers and fruit cultivation help us discuss stationarity - a property of some time serieses that are invariant to time in several ways. Differencing is one approach that can often convert a non-stationary process into a stationary one. If ...Show More

23:04 | May 13th, 2016

I'm joined by Wes McKinney (@wesmckinn) and Hadley Wickham (@hadleywickham) on this episode to discuss their joint project Feather. Feather is a file format for storing data frames along with some metadata, to help with interoperability between langu...Show More

15:03 | May 6th, 2016

Bargaining is the process of two (or more) parties attempting to agree on the price for a transaction.  Game theoretic approaches attempt to find two strategies from which neither party is motivated to deviate.  These strategies are said to be in equ...Show More

29:53 | Apr 29th, 2016

Deepjazz is a project from Ji-Sung Kim, a computer science student at Princeton University. It is built using Theano, Keras, music21, and Evan Chow's project jazzml. Deepjazz is a computational music project that creates original jazz compositions us...Show More

14:58 | Apr 22nd, 2016

When working with time series data, there are a number of important diagnostics one should consider to help understand more about the data. The auto-correlative function, plotted as a correlogram, helps explain how a given observations relates to rec...Show More

27:05 | Apr 15th, 2016

This week I spoke with Elham Shaabani and Paulo Shakarian (@PauloShakASU) about their recent paper Early Identification of Violent Criminal Gang Members (also available onarXiv). In this paper, they use social network analysis techniques and machine ...Show More

11:09 | Apr 8th, 2016

A dinner party at Data Skeptic HQ helps teach the uses of fractional factorial design for studying 2-way interactions.

25:21 | Apr 1st, 2016

Cheng-tao Chu (@chengtao_chu) joins us this week to discuss his perspective on common mistakes and pitfalls that are made when doing machine learning. This episode is filled with sage advice for beginners and intermediate users of machine learning, a...Show More

41:22 | Mar 25th, 2016

Co-host Linh Da was in a biking accident after hitting a pothole. She sustained an injury that required stitches. This is the story of our quest to file a 311 complaint and track it through the City of Los Angeles's open data portal. My guests this ...Show More

15:14 | Mar 18th, 2016

Certain data mining algorithms (including k-means clustering and k-nearest neighbors) require a user defined parameter k. A user of these algorithms is required to select this value, which raises the questions: what is the "best" value of k that one ...Show More

35:11 | Mar 11th, 2016

Today on Data Skeptic, Lachlan Gunn joins us to discuss his recent paper Too Good to be True. This paper highlights a somewhat paradoxical / counterintuitive fact about how unanimity is unexpected in cases where perfect measurements cannot be taken. ...Show More

13:20 | Mar 4th, 2016

How well does your model explain your data? R-squared is a useful statistic for answering this question. In this episode we explore how it applies to the problem of valuing a house. Aspects like the number of bedrooms go a long way in explaining why ...Show More

39:44 | Feb 26th, 2016

Jessica Hamrick joins us this week to discuss her work studying mental simulation. Her research combines machine learning approaches iwth behavioral method from cognitive science to help explain how people reason and predict outcomes. Her recent pape...Show More

18:29 | Feb 19th, 2016

This episode is a discussion of multiple regression: the use of observations that are a vector of values to predict a response variable. For this episode, we consider how features of a home such as the number of bedrooms, number of bathrooms, and squ...Show More

42:14 | Feb 12th, 2016

Samuel Mehr joins us this week to share his perspective on why people are musical, where music comes from, and why it works the way it does. We discuss a number of empirical studies related to music and musical cognition, and dispense a few myths abo...Show More

14:11 | Feb 5th, 2016

This episode reviews the concept of k-d trees: an efficient data structure for holding multidimensional objects. Kyle gives Linhda a dictionary and asks her to look up words as a way of introducing the concept of binary search. We actually spend most...Show More

42:58 | Jan 29th, 2016

Algorithms are pervasive in our society and make thousands of automated decisions on our behalf every day. The possibility of digital discrimination is a very real threat, and it is very plausible for discrimination to occur accidentally (i.e. outsid...Show More

14:29 | Jan 22nd, 2016

Today's episode begins by asking how many left handed employees we should expect to be at a company before anyone should claim left handedness discrimination. If not lefties, let's consider eye color, hair color, favorite ska band, most recent grocer...Show More

37:37 | Jan 15th, 2016

A recent paper in the journal of Judgment and Decision Making titled On the reception and detection of pseudo-profound bullshit explores empirical questions around a reader's ability to detect statements which may sound profound but are actually a co...Show More

14:51 | Jan 8th, 2016

Today's mini episode discusses the widely known optimization algorithm gradient descent in the context of hiking in a foggy hillside.

15:03 | Jan 1st, 2016

This episode is a discussion of data visualization and a proposed New Year's resolution for Data Skeptic listeners. Let's kill the word cloud.

14:22 | Dec 25th, 2015

Today's episode is a reading of Isaac Asimov's The Machine that Won the War. I can't think of a story that's more appropriate for Data Skeptic.

42:56 | Dec 18th, 2015

In this interview with Aaron Halfaker of the Wikimedia Foundation, we discuss his research and career related to the study of Wikipedia. In his paper The Rise and Decline of an open Collaboration Community, he highlights a trend in the declining rate...Show More

10:17 | Dec 11th, 2015

Today's topic is term frequency inverse document frequency, which is a statistic for estimating the importance of words and phrases in a set of documents.

41:31 | Dec 4th, 2015

Early astronomers could see several of the planets with the naked eye. The invention of the telescope allowed for further understanding of our solar system. The work of Isaac Newton allowed later scientists to accurately predict Neptune, which was la...Show More

17:04 | Nov 27th, 2015

Today's episode discusses the accuracy paradox. There are cases when one might prefer a less accurate model because it yields more predictive power or better captures the underlying causal factors describing the outcome variable you are interested in...Show More

40:18 | Nov 20th, 2015

... or should this have been called data science from a neuroscientist's perspective? Either way, I'm sure you'll enjoy this discussion with Laurie Skelly. Laurie earned a PhD in Integrative Neuroscience from the Department of Psychology at the Unive...Show More

13:35 | Nov 13th, 2015

A discussion of the expected number of cars at a stoplight frames today's discussion of the bias variance tradeoff. The central ideal of this concept relates to model complexity. A very simple model will likely generalize well from training to testin...Show More

32:28 | Nov 6th, 2015

The recent opinion piece Big Data Doesn't Exist on Tech Crunch by Slater Victoroff is an interesting discussion about the usefulness of data both big and small. Slater joins me this episode to discuss and expand on this discussion. Slater Victoroff ...Show More

14:29 | Oct 30th, 2015

The degree to which two variables change together can be calculated in the form of their covariance. This value can be normalized to the correlation coefficient, which has the advantage of transforming it to a unitless measure strictly bounded betwee...Show More

30:11 | Oct 23rd, 2015

Today's guest is Cameron Davidson-Pilon. Cameron has a masters degree in quantitative finance from the University of Waterloo. Think of it as statistics on stock markets. For the last two years he's been the team lead of data science at Shopify. He's...Show More

13:07 | Oct 16th, 2015

The central limit theorem is an important statistical result which states that typically, the mean of a large enough set of independent trials is approximately normally distributed.  This episode explores how this might be used to determine if an ama...Show More

38:44 | Oct 9th, 2015

Today's guest is Chris Hofstader (@gonz_blinko), an accessibility researcher and advocate, as well as an activist for causes such as improving access to information for blind and vision impaired people. His background in computer programming enabled ...Show More

12:47 | Oct 2nd, 2015

The multi-armed bandit problem is named with reference to slot machines (one armed bandits). Given the chance to play from a pool of slot machines, all with unknown payout frequencies, how can you maximize your reward? If you knew in advance which ma...Show More

58:14 | Sep 25th, 2015

Our episode this week begins with a correction. Back in episode 28 (Monkeys on Typewriters), Kyle made some bold claims about the probability that monkeys banging on typewriters might produce the entire works of Shakespeare by chance. The proof shown...Show More

13:22 | Sep 18th, 2015

There are several factors that are important to selecting an appropriate sample size and dealing with small samples. The most important questions are around representativeness - how well does your sample represent the total population and capture all...Show More

30:01 | Sep 11th, 2015

There's an old adage which says you cannot fit a model which has more parameters than you have data. While this is often the case, it's not a universal truth. Today's guest Jake VanderPlas explains this topic in detail and provides some excellent exa...Show More

12:44 | Sep 4th, 2015

There are many occasions in which one might want to know the distance or similarity between two things, for which the means of calculating that distance is not necessarily clear. The distance between two points in Euclidean space is generally straigh...Show More

53:11 | Aug 28th, 2015

ContentMine is a project which provides the tools and workflow to convert scientific literature into machine readable and machine interpretable data in order to facilitate better and more effective access to the accumulated knowledge of human kind. T...Show More

13:20 | Aug 21st, 2015

Today's mini-episode explains the distinction between structured and unstructured data, and debates which of these categories best describe recipes.

24:42 | Aug 14th, 2015

Yusan Lin shares her research on using data science to explore the fashion industry in this episode. She has applied techniques from data mining, natural language processing, and social network analysis to explore who are the innovators in the fashio...Show More

08:29 | Aug 7th, 2015

PageRank is the algorithm most famous for being one of the original innovations that made Google stand out as a search engine. It was defined in the classic paper The Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin and Larry Pa...Show More

41:26 | Jul 29th, 2015

In this episode, Benjamin Uminsky enlightens us about some of the ways the Los Angeles County Registrar-Recorder/County Clerk leverages data science and analysis to help be more effective and efficient with the services and expectations they provide ...Show More

08:33 | Jul 24th, 2015

This episode explores the k-nearest neighbors algorithm which is an unsupervised, non-parametric method that can be used for both classification and regression. The basica concept is that it leverages some distance function on your dataset to find th...Show More