podcast cover
Software Engineering

Linear Digressions

Ben Jaffe and Katie Malone

+24 FANS
In each episode, your hosts explore machine learning and data science through interesting (and often very unusual) applications.
Best
Newest
Convolutional Neural Nets

21:55 | Apr 2nd, 2018

If you've done image recognition or computer vision tasks with a neural network, you've probably used a convolutional neural net. This episode is all about the architecture and implementation details of convolutional networks, and the tricks that mak...Show More
A Technical Deep Dive on Stanley, the First Self-Driving Car

41:32 | Aug 12th

This is a re-release of an episode that first ran on April 9, 2017. In our follow-up episode to last week's introduction to the first self-driving car, we will be doing a technical deep dive this week and talking about the most important systems f...Show More
An Introduction to Stanley, the First Self-Driving Car

14:19 | Aug 5th

In October 2005, 23 cars lined up in the desert for a 140 mile race. Not one of those cars had a driver. This was the DARPA grand challenge to see if anyone could build an autonomous vehicle capable of navigating a desert route (and if so, whose ca...Show More
Putting the "science" in data science: the scientific method, the null hypothesis, and p-hacking

24:11 | Jul 29th

The modern scientific method is one of the greatest (perhaps the greatest?) system we have for discovering knowledge about the world. It’s no surprise then that many data scientists have found their skills in high demand in the business world, where ...Show More
Interleaving

16:54 | Jul 22nd

If you’re Google or Netflix, and you have a recommendation or search system as part of your bread and butter, what’s the best way to test improvements to your algorithm? A/B testing is the canonical answer for testing how users respond to software ch...Show More
Federated Learning

15:03 | Jul 14th

This is a re-release of an episode first released in May 2017. As machine learning makes its way into more and more mobile devices, an interesting question presents itself: how can we have an algorithm learn from training data that's being supplie...Show More
Endogenous Variables and Measuring Protest Effectiveness

17:58 | Jul 7th

This is a re-release of an episode first released in February 2017. Have you been out protesting lately, or watching the protests, and wondered how much effect they might have on lawmakers? It's a tricky question to answer, since usually we need ...Show More
Deepfakes

15:08 | Jul 1st

Generative adversarial networks (GANs) are producing some of the most realistic artificial videos we’ve ever seen. These videos are usually called “deepfakes”. Even to an experienced eye, it can be a challenge to distinguish a fabricated video from a...Show More
Revisiting Biased Word Embeddings

18:09 | Jun 24th

The topic of bias in word embeddings gets yet another pass this week. It all started a few years ago, when an analogy task performed on Word2Vec embeddings showed some indications of gender bias around professions (as well as other forms of social bi...Show More
Attention in Neural Nets

26:32 | Jun 17th

There’s been a lot of interest lately in the attention mechanism in neural nets—it’s got a colloquial name (who’s not familiar with the idea of “attention”?) but it’s more like a technical trick that’s been pivotal to some recent advances in computer...Show More
Interview with Joel Grus

39:46 | Jun 10th

This week’s episode is a special one, as we’re welcoming a guest: Joel Grus is a data scientist with a strong software engineering streak, and he does an impressive amount of speaking, writing, and podcasting as well. Whether you’re a new data scient...Show More
Re - Release: Factorization Machines

20:09 | Jun 3rd

What do you get when you cross a support vector machine with matrix factorization? You get a factorization machine, and a darn fine algorithm for recommendation engines.
Re-release: Auto-generating websites with deep learning

19:38 | May 27th

We've already talked about neural nets in some detail (links below), and in particular we've been blown away by the way that image recognition from convolutional neural nets can be fed into recurrent neural nets that generate descriptions and caption...Show More
Advice to those trying to get a first job in data science

17:33 | May 19th

We often hear from folks wondering what advice we can give them as they search for their first job in data science. What does a hiring manager look for? Should someone focus on taking classes online, doing a bootcamp, reading books, something else? H...Show More
Re - Release: Machine Learning Technical Debt

22:29 | May 12th

This week, we've got a fun paper by our friends at Google about the hidden costs of maintaining machine learning workflows. If you've worked in software before, you're probably familiar with the idea of technical debt, which are inefficiencies that ...Show More
Estimating Software Projects, and Why It's Hard

19:07 | May 5th

If you’re like most software engineers and, especially, data scientists, you find it really hard to make accurate estimates of how long a project will take to complete. Don’t feel bad: statistics is most likely actively working against your best effo...Show More
The Black Hole Algorithm

20:17 | Apr 29th

53.5 million light-years away, there’s a gigantic galaxy called M87 with something interesting going on inside it. Between Einstein’s theory of relativity and the motion of a group of stars in the galaxy (the motion is characteristic of there being a...Show More
Structure in AI

19:05 | Apr 21st

As artificial intelligence algorithms get applied to more and more domains, a question that often arises is whether to somehow build structure into the algorithm itself to mimic the structure of the problem. There’s usually some amount of knowledge w...Show More
The Great Data Science Specialist vs. Generalist Debate

14:10 | Apr 15th

It’s not news that data scientists are expected to be capable in many different areas (writing software, designing experiments, analyzing data, talking to non-technical stakeholders). One thing that has been changing, though, as the field becomes a b...Show More
Google X, and Taking Risks the Smart Way

19:04 | Apr 8th

If you work in data science, you’re well aware of the sheer volume of high-risk, high-reward projects that are hypothetically possible. The fact that they’re high-reward means they’re exciting to think about, and the payoff would be huge if they succ...Show More
Statistical Significance in Hypothesis Testing

22:34 | Apr 1st

When you are running an AB test, one of the most important questions is how much data to collect. Collect too little, and you can end up drawing the wrong conclusion from your experiment. But in a world where experimenting is generally not free, and ...Show More
The Language Model Too Dangerous to Release

21:01 | Mar 25th

OpenAI recently created a cutting-edge new natural language processing model, but unlike all their other projects so far, they have not released it to the public. Why? It seems to be a little too good. It can answer reading comprehension questions, s...Show More
The cathedral and the bazaar

32:36 | Mar 17th

Imagine you have two choices of how to build something: top-down and controlled, with a few people playing a master designer role, or bottom-up and free-for-all, with nobody playing an explicit architect role. Which one do you think would make the be...Show More
AlphaStar

22:03 | Mar 11th

It’s time for our latest installation in the series on artificial intelligence agents beating humans at games that we thought were safe from the robots. In this case, the game is StarCraft, and the AI agent is AlphaStar, from the same team that built...Show More
Are machine learning engineers the new data scientists?

20:46 | Mar 4th

For many data scientists, maintaining models and workflows in production is both a huge part of their job and not something they necessarily trained for if their background is more in statistics or machine learning methodology. Productionizing and ma...Show More
Interview with Alex Radovic, particle physicist turned machine learning researcher

35:42 | Feb 25th

You’d be hard-pressed to find a field with bigger, richer, and more scientifically valuable data than particle physics. Years before “data scientist” was even a term, particle physicists were inventing technologies like the world wide web and cloud c...Show More
K Nearest Neighbors

16:25 | Feb 17th

K Nearest Neighbors is an algorithm with secrets. On one hand, the algorithm itself is as straightforward as possible: find the labeled points nearest the point that you need to predict, and make a prediction that’s the average of their answers. On t...Show More
Not every deep learning paper is great. Is that a problem?

17:54 | Feb 11th

Deep learning is a field that’s growing quickly. That’s good! There are lots of new deep learning papers put out every day. That’s good too… right? What if not every paper out there is particularly good? What even makes a paper good in the first plac...Show More
The Assumptions of Ordinary Least Squares

25:07 | Feb 3rd

Ordinary least squares (OLS) is often used synonymously with linear regression. If you’re a data scientist, machine learner, or statistician, you bump into it daily. If you haven’t had the opportunity to build up your understanding from the foundatio...Show More
Quantile Regression

21:46 | Jan 28th

Linear regression is a great tool if you want to make predictions about the mean value that an outcome will have given certain values for the inputs. But what if you want to predict the median? Or the 10th percentile? Or the 90th percentile. You need...Show More
Heterogeneous Treatment Effects

17:24 | Jan 20th

When data scientists use a linear regression to look for causal relationships between a treatment and an outcome, what they’re usually finding is the so-called average treatment effect. In other words, on average, here’s what the treatment does in te...Show More
Pre-training language models for natural language processing problems

27:35 | Jan 14th

When you build a model for natural language processing (NLP), such as a recurrent neural network, it helps a ton if you’re not starting from zero. In other words, if you can draw upon other datasets for building your understanding of word meanings, a...Show More
Facial Recognition, Society, and the Law

42:46 | Jan 7th

Facial recognition being used in everyday life seemed far-off not too long ago. Increasingly, it’s being used and advanced widely and with increasing speed, which means that our technical capabilities are starting to outpace (if they haven’t already)...Show More
Re-release: Word2Vec

17:59 | Dec 31st, 2018

Bringing you another old classic this week, as we gear up for 2019! See you next week with new content. Word2Vec is probably the go-to algorithm for vectorizing text data these days.  Which makes sense, because it is wicked cool.  Word2Vec has it ...Show More
Re - Release: The Cold Start Problem

15:37 | Dec 23rd, 2018

We’re taking a break for the holidays, chilling with the dog and an eggnog (Katie) and the cat and some spiced cider (Ben). Here’s an episode from a while back for you to enjoy. See you again in 2019! You might sometimes find that it's hard to get...Show More
Convex (and non-convex) Optimization

20:00 | Dec 17th, 2018

Convex optimization is one of the keys to data science, both because some problems straight-up call for optimization solutions and because popular algorithms like a gradient descent solution to ordinary least squares are supported by optimization tec...Show More
The Normal Distribution and the Central Limit Theorem

27:11 | Dec 9th, 2018

When you think about it, it’s pretty amazing that we can draw conclusions about huge populations, even the whole world, based on datasets that are comparatively very small (a few thousand, or a few hundred, or even sometimes a few dozen). That’s the ...Show More
Software 2.0

17:22 | Dec 2nd, 2018

Neural nets are a way you can model a system, sure, but if you take a step back, squint, and tilt your head, they can also be called… software? Not in the sense that they’re written in code, but in the sense that the neural net itself operates under ...Show More
Limitations of Deep Nets for Computer Vision

27:20 | Nov 18th, 2018

Deep neural nets have a deserved reputation as the best-in-breed solution for computer vision problems. But there are many aspects of human vision that we take for granted but where neural nets struggle—this episode covers an eye-opening paper that s...Show More
Building Data Science Teams

25:09 | Nov 12th, 2018

At many places, data scientists don’t work solo anymore—it’s a team sport. But data science teams aren’t simply teams of data scientists working together. Instead, they’re usually cross-functional teams with engineers, managers, data scientists, and ...Show More
Optimized Optimized Web Crawling

19:42 | Nov 4th, 2018

Last week’s episode, about methods for optimized web crawling logic, left off on a bit of a cliffhanger: the data scientists had found a solution to the problem, but it wasn’t something that the engineers (who own the search codebase, remember) liked...Show More
Optimized Web Crawling

21:32 | Oct 28th, 2018

Got a fun optimization problem for you this week! It’s a two-for-one: how do you optimize the web crawling logic of an operation like Google search so that the results are, on average, as up-to-date as possible, and how do you optimize your solution ...Show More
Better Know a Distribution: The Poisson Distribution

31:51 | Oct 22nd, 2018

The Poisson distribution is a probability distribution function used to for events that happen in time or space. It’s super handy because it’s pretty simple to use and is applicable for tons of things—there are a lot of interesting processes that boi...Show More
Searching for Datasets with Google

19:54 | Oct 15th, 2018

If you wanted to find a dataset of jokes, how would you do it? What about a dataset of podcast episodes? If your answer was “I’d try Google,” you might have been disappointed—Google is a great search engine for many types of web data, but it didn’t h...Show More
It's our fourth birthday

22:06 | Oct 8th, 2018

We started Linear Digressions 4 years ago… this isn’t a technical episode, just two buddies shooting the breeze about something we’ve somehow built together.
Gigantic Searches in Particle Physics

24:46 | Sep 30th, 2018

This week, we’re dusting off the ol’ particle physics PhD to bring you an episode about ambitious new model-agnostic searches for new particles happening at CERN. Traditionally, new particles have been discovered by “targeted searches,” where scienti...Show More
Data Engineering

16:22 | Sep 24th, 2018

If you’re a data scientist, you know how important it is to keep your data orderly, clean, moving smoothly between different systems, well-documented… there’s a ton of work that goes into building and maintaining databases and data pipelines. This jo...Show More
Text Analysis for Guessing the NYTimes Op-Ed Author

18:37 | Sep 16th, 2018

A very intriguing op-ed was published in the NY Times recently, in which the author (a senior official in the Trump White House) claimed to be a minor saboteur of sorts, acting with his or her colleagues to undermine some of Donald Trump’s worst inst...Show More
The Three Types of Data Scientists, and What They Actually Do

23:25 | Sep 9th, 2018

If you've been in data science for more than a year or two, chances are you've noticed changes in the field as it's grown and matured. And if you're newer to the field, you may feel like there's a disconnect between lots of different stories about wh...Show More
Agile Development for Data Scientists, Part 2: Where Modifications Help

27:17 | Aug 26th, 2018

There's just too much interesting stuff at the intersection of agile software development and data science for us to be able to cover it all in one episode, so this week we're picking up where we left off last time. We'll give a quick overview of agi...Show More
Agile Development for Data Scientists, Part 1: The Good

25:56 | Aug 19th, 2018

If you're a data scientist at a firm that does a lot of software building, chances are good that you've seen or heard engineers sometimes talking about "agile software development." If you don't work at a software firm, agile practices might be newer...Show More
Re - Release: How To Lose At Kaggle

17:54 | Aug 13th, 2018

We've got a classic for you this week as we take a week off for the dog days of summer. See you again next week! Competing in a machine learning competition on Kaggle is a kind of rite of passage for data scientists. Losing unexpectedly at the ve...Show More
Troubling Trends In Machine Learning Scholarship

29:35 | Aug 6th, 2018

There's a lot of great machine learning papers coming out every day--and, if we're being honest, some papers that are not as great as we'd wish. In some ways this is symptomatic of a field that's growing really quickly, but it's also an artifact of s...Show More
Can Fancy Running Shoes Cause You To Run Faster?

28:37 | Jul 29th, 2018

The stars aligned for me (Katie) this past weekend: I raced my first half-marathon in a long time and got to read a great article from the NY Times about a new running shoe that Nike claims can make its wearers run faster. Causal claims like this one...Show More
Compliance Bias

23:28 | Jul 22nd, 2018

When you're using an AB test to understand the effect of a treatment, there are a lot of assumptions about how the treatment (and control, for that matter) get applied. For example, it's easy to think that everyone who was assigned to the treatment a...Show More
AI Winter

19:02 | Jul 15th, 2018

Artificial Intelligence has been widely lauded as a solution to almost any problem. But as we justapose the hype in the field against the real-world benefits we see, it raises the question: Are we coming up on an AI winter
Rerelease: How to Find New Things to Learn

18:32 | Jul 8th, 2018

We like learning on vacation. And we're on vacation, so we thought we'd re-air this episode about how to learn. Original Episode: https://lineardigressions.com/episodes/2017/5/14/how-to-find-new-things-to-learn Original Summary: If you're anyth...Show More
Rerelease: Space Codes

24:30 | Jul 2nd, 2018

We're on vacation on Mars, so we won't be communicating with you all directly this week. Though, if we wanted to, we could probably use this episode to help get started. Original Episode: http://lineardigressions.com/episodes/2017/3/19/space-codes...Show More
Rerelease: Anscombe's Quartet

16:14 | Jun 25th, 2018

We're on vacation, so we hope you enjoy this episode while we each sip cocktails on the beach. Original Episode: http://lineardigressions.com/episodes/2017/6/18/anscombes-quartet Original Summary: Anscombe's Quartet is a set of four datasets th...Show More
Rerelease: Hurricanes Produced

28:12 | Jun 18th, 2018

Now that hurricane season is upon us again (and we are on vacation), we thought a look back on our hurricane forecasting episode was prudent. Stay safe out there.
GDPR

18:24 | Jun 11th, 2018

By now, you have probably heard of GDPR, the EU's new data privacy law. It's the reason you've been getting so many emails about everyone's updated privacy policy. In this episode, we talk about some of the potential ramifications of GRPD in the w...Show More
Git for Data Scientists

22:05 | Jun 3rd, 2018

If you're a data scientist, chances are good that you've heard of git, which is a system for version controlling code. Chances are also good that you're not quite as up on git as you want to be--git has a strong following among software engineers but...Show More
Analytics Maturity

19:32 | May 20th, 2018

Data science and analytics are hot topics in business these days, but for a lot of folks looking to bring data into their organization, it can be hard to know where to start and what it looks like when they're succeeding. That was the motivation for ...Show More
SHAP: Shapley Values in Machine Learning

19:12 | May 13th, 2018

Shapley values in machine learning are an interesting and useful enough innovation that we figured hey, why not do a two-parter? Our last episode focused on explaining what Shapley values are: they define a way of assigning credit for outcomes across...Show More
Game Theory for Model Interpretability: Shapley Values

27:06 | May 7th, 2018

As machine learning models get into the hands of more and more users, there's an increasing expectation that black box isn't good enough: users want to understand why the model made a given prediction, not just what the prediction itself is. This is ...Show More
AutoML

15:24 | Apr 30th, 2018

If you were a machine learning researcher or data scientist ten years ago, you might have spent a lot of time implementing individual algorithms like decision trees and neural networks by hand. If you were doing that work five years ago, the algorith...Show More
CPUs, GPUs, TPUs: Hardware for Deep Learning

12:40 | Apr 23rd, 2018

A huge part of the ascent of deep learning in the last few years is related to advances in computer hardware that makes it possible to do the computational heavy lifting required to build models with thousands or even millions of tunable parameters. ...Show More
A Technical Introduction to Capsule Networks

31:28 | Apr 16th, 2018

Last episode we talked conceptually about capsule networks, the latest and greatest computer vision innovation to come out of Geoff Hinton's lab. This week we're getting a little more into the technical details, for those of you ready to have your mi...Show More
A Conceptual Introduction to Capsule Networks

14:05 | Apr 9th, 2018

Convolutional nets are great for image classification... if this were 2016. But it's 2018 and Canada's greatest neural networker Geoff Hinton has some new ideas, namely capsule networks. Capsule nets are a completely new type of neural net architectu...Show More
Google Flu Trends

12:46 | Mar 26th, 2018

It's been a nasty flu season this year. So we were remembering a story from a few years back (but not covered yet on this podcast) about when Google tried to predict flu outbreaks faster than the Centers for Disease Control by monitoring searches and...Show More
How to pick projects for a professional data science team

31:17 | Mar 19th, 2018

This week's episodes is for data scientists, sure, but also for data science managers and executives at companies with data science teams. These folks all think very differently about the same question: what should a data science team be working on? ...Show More
Autoencoders

12:41 | Mar 12th, 2018

Autoencoders are neural nets that are optimized for creating outputs that... look like the inputs to the network. Turns out this is a not-too-shabby way to do unsupervised machine learning with neural nets.
When Private Data Isn't Private Anymore

26:20 | Mar 5th, 2018

After all the back-patting around making data science datasets and code more openly available, we figured it was time to also dump a bucket of cold water on everyone's heads and talk about the things that can go wrong when data and code is a little t...Show More
What makes a machine learning algorithm "superhuman"?

34:48 | Feb 26th, 2018

A few weeks ago, we podcasted about a neural network that was being touted as "better than doctors" in diagnosing pneumonia from chest x-rays, and how the underlying dataset used to train the algorithm raised some serious questions. We're back again ...Show More
Open Data and Open Science

16:54 | Feb 19th, 2018

One interesting trend we've noted recently is the proliferation of papers, articles and blog posts about data science that don't just tell the result--they include data and code that allow anyone to repeat the analysis. It's far from universal (for a...Show More
Defining the quality of a machine learning production system

20:29 | Feb 12th, 2018

Building a machine learning system and maintaining it in production are two very different things. Some folks over at Google wrote a paper that shares their thoughts around all the items you might want to test or check for your production ML system. ...Show More
Auto-generating websites with deep learning

19:24 | Feb 4th, 2018

We've already talked about neural nets in some detail (links below), and in particular we've been blown away by the way that image recognition from convolutional neural nets can be fed into recurrent neural nets that generate descriptions and caption...Show More
The Case for Learned Index Structures, Part 2: Hash Maps and Bloom Filters

20:41 | Jan 29th, 2018

Last week we started the story of how you could use a machine learning model in place of a data structure, and this week we wrap up with an exploration of Bloom Filters and Hash Maps. Just like last week, when we covered B-trees, we'll walk through b...Show More
The Case for Learned Index Structures, Part 1: B-Trees

18:50 | Jan 22nd, 2018

Jeff Dean and his collaborators at Google are turning the machine learning world upside down (again) with a recent paper about how machine learning models can be used as surprisingly effective substitutes for classic data structures. In this first pa...Show More
Challenges with Using Machine Learning to Classify Chest X-Rays

18:00 | Jan 15th, 2018

Another installment in our "machine learning might not be a silver bullet for solving medical problems" series. This week, we have a high-profile blog post that has been making the rounds for the last few weeks, in which a neural network trained to v...Show More
The Fourier Transform

15:39 | Jan 8th, 2018

The Fourier transform is one of the handiest tools in signal processing for dealing with periodic time series data. Using a Fourier transform, you can break apart a complex periodic function into a bunch of sine and cosine waves, and figure out what ...Show More
Statistics of Beer

15:20 | Jan 2nd, 2018

What better way to kick off a new year than with an episode on the statistics of brewing beer?
Re - Release: Random Kanye

09:33 | Dec 24th, 2017

We have a throwback episode for you today as we take the week off to enjoy the holidays. This week: what happens when you have a markov chain that generates mashup Kanye West lyrics with Bible verses? Exactly what you think.
Debiasing Word Embeddings

18:20 | Dec 18th, 2017

When we covered the Word2Vec algorithm for embedding words, we mentioned parenthetically that the word embeddings it produces can sometimes be a little bit less than ideal--in particular, gender bias from our society can creep into the embeddings and...Show More
The Kernel Trick and Support Vector Machines

17:48 | Dec 11th, 2017

Picking up after last week's episode about maximal margin classifiers, this week we'll go into the kernel trick and how that (combined with maximal margin algorithms) gives us the much-vaunted support vector machine.
Maximal Margin Classifiers

14:21 | Dec 4th, 2017

Maximal margin classifiers are a way of thinking about supervised learning entirely in terms of the decision boundary between two classes, and defining that boundary in a way that maximizes the distance from any given point to the boundary. It's a n...Show More
Re - Release: The Cocktail Party Problem

13:43 | Nov 27th, 2017

Grab a cocktail, put on your favorite karaoke track, and let’s talk some more about disentangling audio data!
Clustering with DBSCAN

16:14 | Nov 20th, 2017

DBSCAN is a density-based clustering algorithm for doing unsupervised learning. It's pretty nifty: with just two parameters, you can specify "dense" regions in your data, and grow those regions out organically to find clusters. In particular, it ca...Show More
The Kaggle Survey on Data Science

25:20 | Nov 13th, 2017

Want to know what's going on in data science these days?  There's no better way than to analyze a survey with over 16,000 responses that recently released by Kaggle.  Kaggle asked practicing and aspiring data scientists about themselves, their tools,...Show More
Machine Learning: The High Interest Credit Card of Technical Debt

22:18 | Nov 6th, 2017

This week, we've got a fun paper by our friends at Google about the hidden costs of maintaining machine learning workflows. If you've worked in software before, you're probably familiar with the idea of technical debt, which are inefficiencies that ...Show More
Improving Upon a First-Draft Data Science Analysis

15:01 | Oct 30th, 2017

There are a lot of good resources out there for getting started with data science and machine learning, where you can walk through starting with a dataset and ending up with a model and set of predictions. Think something like the homework for your ...Show More
Survey Raking

17:23 | Oct 23rd, 2017

It's quite common for survey respondents not to be representative of the larger population from which they are drawn. But if you're a researcher, you need to study the larger population using data from your survey respondents, so what should you do?...Show More
Happy Hacktoberfest

15:40 | Oct 16th, 2017

It's the middle of October, so you've already made two pull requests to open source repos, right? If you have no idea what we're talking about, spend the next 20 minutes or so with us talking about the importance of open source software and how you c...Show More
Re - Release: Kalman Runners

17:53 | Oct 9th, 2017

In honor of the Chicago marathon this weekend (and due in large part to Katie recovering from running in it...) we have a re-release of an episode about Kalman filters, which is part algorithm part elaborate metaphor for figuring out, if you're runni...Show More
Neural Net Dropout

18:53 | Oct 2nd, 2017

Neural networks are complex models with many parameters and can be prone to overfitting.  There's a surprisingly simple way to guard against this: randomly destroy connections between hidden units, also known as dropout.  It seems counterintuitive th...Show More
Disciplined Data Science

29:34 | Sep 25th, 2017

As data science matures as a field, it's becoming clearer what attributes a data science team needs to have to elevate their work to the next level. Most of our episodes are about the cool work being done by other people, but this one summarizes som...Show More
Hurricane Forecasting

27:57 | Sep 18th, 2017

It's been a busy hurricane season in the Southeastern United States, with millions of people making life-or-death decisions based on the forecasts around where the hurricanes will hit and with what intensity. In this episode we'll deconstruct those ...Show More
Finding Spy Planes with Machine Learning

18:09 | Sep 11th, 2017

There are law enforcement surveillance aircraft circling over the United States every day, and in this episode, we'll talk about how some folks at BuzzFeed used public data and machine learning to find them.  The fun thing here, in our opinion, is th...Show More
Data Provenance

22:48 | Sep 4th, 2017

Software engineers are familiar with the idea of versioning code, so you can go back later and revive a past state of the system.  For data scientists who might want to reconstruct past models, though, it's not just about keeping the modeling code.  ...Show More
Adversarial Examples

16:11 | Aug 28th, 2017

Even as we rely more and more on machine learning algorithms to help with everyday decision-making, we're learning more and more about how they're frighteningly easy to fool sometimes. Today we have a roundup of a few successful efforts to create ro...Show More
Jupyter Notebooks

15:50 | Aug 21st, 2017

This week's episode is just in time for JupyterCon in NYC, August 22-25... Jupyter notebooks are probably familiar to a lot of data nerds out there as a great open-source tool for exploring data, doing quick visualizations, and packaging code snip...Show More
Curing Cancer with Machine Learning is Super Hard

19:20 | Aug 14th, 2017

Today, a dispatch on what can go wrong when machine learning hype outpaces reality: a high-profile partnership between IBM Watson and MD Anderson Cancer Center has recently hit the rocks as it turns out to be tougher than expected to cure cancer with...Show More
KL Divergence

25:38 | Aug 7th, 2017

Kullback Leibler divergence, or KL divergence, is a measure of information loss when you try to approximate one distribution with another distribution.  It comes to us originally from information theory, but today underpins other, more machine-learni...Show More
Sabermetrics

25:48 | Jul 31st, 2017

It's moneyball time! SABR (the Society for American Baseball Research) is the world's largest organization of statistics-minded baseball enthusiasts, who are constantly applying the craft of scientific analysis to trying to figure out who are the be...Show More
What Data Scientists Can Learn from Software Engineers

23:46 | Jul 24th, 2017

We're back again with friend of the pod Walt, former software engineer extraordinaire and current data scientist extraordinaire, to talk about some best practices from software engineering that are ready to jump the fence over to data science.  If la...Show More
Software Engineering to Data Science

19:05 | Jul 17th, 2017

Data scientists and software engineers often work side by side, building out and scaling technical products and services that are data-heavy but also require a lot of software engineering to build and maintain.  In this episode, we'll chat with a Fri...Show More
Re-Release: Fighting Cholera with Data, 1854

12:04 | Jul 10th, 2017

This episode was first released in November 2014. In the 1850s, there were a lot of things we didn’t know yet: how to create an airplane, how to split an atom, or how to control the spread of a common but deadly disease: cholera. When a cholera...Show More
Re-Release: Data Mining Enron

32:16 | Jul 2nd, 2017

This episode was first release in February 2015. In 2000, Enron was one of the largest and companies in the world, praised far and wide for its innovations in energy distribution and many other markets. By 2002, it was apparent that many bad apple...Show More
Factorization Machines

19:54 | Jun 26th, 2017

What do you get when you cross a support vector machine with matrix factorization? You get a factorization machine, and a darn fine algorithm for recommendation engines.
Anscombe's Quartet

15:39 | Jun 19th, 2017

Anscombe's Quartet is a set of four datasets that have the same mean, variance and correlation but look very different. It's easy to think that having a good set of summary statistics (like mean, variance and correlation) can tell you everything imp...Show More
Traffic Metering Algorithms

18:34 | Jun 12th, 2017

Originally release June 2016 This episode is for all you (us) traffic nerds--we're talking about the hidden structure underlying traffic on-ramp metering systems. These systems slow down the flow of traffic onto highways so that the highways don't...Show More
Page Rank

19:58 | Jun 5th, 2017

The year: 1998.  The size of the web: 150 million pages.  The problem: information retrieval.  How do you find the "best" web pages to return in response to a query?  A graduate student named Larry Page had an idea for how it could be done better and...Show More
Fractional Dimensions

20:28 | May 29th, 2017

We chat about fractional dimensions, and what the actual heck those are.
Things You Learn When Building Models for Big Data

21:39 | May 22nd, 2017

As more and more data gets collected seemingly every day, and data scientists use that data for modeling, the technical limits associated with machine learning on big datasets keep getting pushed back.  This week is a first-hand case study in using s...Show More
How to Find New Things to Learn

17:54 | May 15th, 2017

If you're anything like us, you a) always are curious to learn more about data science and machine learning and stuff, and b) are usually overwhelmed by how much content is out there (not all of it very digestible). We hope this podcast is a part of...Show More
Federated Learning

14:03 | May 8th, 2017

As machine learning makes its way into more and more mobile devices, an interesting question presents itself: how can we have an algorithm learn from training data that's being supplied as users interact with the algorithm? In other words, how do we...Show More
Word2Vec

17:59 | May 1st, 2017

Word2Vec is probably the go-to algorithm for vectorizing text data these days.  Which makes sense, because it is wicked cool.  Word2Vec has it all: neural networks, skip-grams and bag-of-words implementations, a multiclass classifier that gets swappe...Show More
Feature Processing for Text Analytics

17:28 | Apr 24th, 2017

It seems like every day there's more and more machine learning problems that involve learning on text data, but text itself makes for fairly lousy inputs to machine learning algorithms.  That's why there are text vectorization algorithms, which re-fo...Show More
Education Analytics

21:05 | Apr 17th, 2017

This week we'll hop into the rapidly developing industry around predictive analytics for education. For many of the students who eventually drop out, data science is showing that there might be early warning signs that the student is in trouble--we'...Show More
A Technical Deep Dive on Stanley, the First Self-Driving Car

40:42 | Apr 10th, 2017

In our follow-up episode to last week's introduction to the first self-driving car, we will be doing a technical deep dive this week and talking about the most important systems for getting a car to drive itself 140 miles across the desert.  Lidar?  ...Show More
An Introduction to Stanley, the First Self-Driving Car

13:07 | Apr 3rd, 2017

In October 2005, 23 cars lined up in the desert for a 140 mile race.  Not one of those cars had a driver.  This was the DARPA grand challenge to see if anyone could build an autonomous vehicle capable of navigating a desert route (and if so, whose ca...Show More
Feature Importance

20:15 | Mar 27th, 2017

Figuring out what features actually matter in a model is harder to figure out than you might first guess.  When a human makes a decision, you can just ask them--why did you do that?  But with machine learning models, not so much.  That's why we wante...Show More
Space Codes!

23:56 | Mar 20th, 2017

It's hard to get information to and from Mars.  Mars is very far away, and expensive to get to, and the bandwidth for passing messages with Earth is not huge.  The messages you do pass have to traverse millions of miles, which provides ample opportun...Show More
Finding (and Studying) Wikipedia Trolls

15:50 | Mar 13th, 2017

You may be shocked to hear this, but sometimes, people on the internet can be mean.  For some of us this is just a minor annoyance, but if you're a maintainer or contributor of a large project like Wikipedia, abusive users can be a huge problem.  Fig...Show More
A Sprint Through What's New in Neural Networks

16:56 | Mar 6th, 2017

Advances in neural networks are moving fast enough that, even though it seems like we talk about them all the time around here, it also always seems like we're barely keeping up.  So this week we have another installment in our "neural nets: they so ...Show More
Stein's Paradox

27:02 | Feb 27th, 2017

When you're estimating something about some object that's a member of a larger group of similar objects (say, the batting average of a baseball player, who belongs to a baseball team), how should you estimate it: use measurements of the individual, o...Show More
Empirical Bayes

18:57 | Feb 20th, 2017

Say you're looking to use some Bayesian methods to estimate parameters of a system. You've got the normalization figured out, and the likelihood, but the prior... what should you use for a prior? Empirical Bayes has an elegant answer: look to your ...Show More
Endogenous Variables and Measuring Protest Effectiveness

16:28 | Feb 13th, 2017

Have you been out protesting lately, or watching the protests, and wondered how much effect they might have on lawmakers? It's a tricky question to answer, since usually we need randomly distributed treatments (e.g. big protests) to understand causa...Show More
Calibrated Models

14:32 | Feb 6th, 2017

Remember last week, when we were talking about how great the ROC curve is for evaluating models? How things change... This week, we're exploring calibrated risk models, because that's a kind of model that seems like it would benefit from some nice ...Show More
Rock the ROC Curve

15:52 | Jan 30th, 2017

This week: everybody's favorite WWII-era classifier metric! But it's not just for winning wars, it's a fantastic go-to metric for all your classifier quality needs.
Ensemble Algorithms

13:08 | Jan 23rd, 2017

If one machine learning model is good, are two models better? In a lot of cases, the answer is yes. If you build many ok models, and then bring them all together and use them in combination to make your final predictions, you've just created an ens...Show More
How to evaluate a translation: BLEU scores

17:06 | Jan 16th, 2017

As anyone who's encountered a badly translated text could tell you, not all translations are created equal. Some translations are smooth, fluent and sound like a poet wrote them; some are jerky, non-grammatical and awkward. When a machine is doing ...Show More
Zero Shot Translation

25:32 | Jan 9th, 2017

Take Google-size data, the flexibility of a neural net, and all (well, most) of the languages of the world, and what you end up with is a pile of surprises. This episode is about some interesting features of Google's new neural machine translation s...Show More
Google Neural Machine Translation

18:12 | Jan 2nd, 2017

Recently, Google swapped out the backend for Google Translate, moving from a statistical phrase-based method to a recurrent neural network. This marks a big change in methodology: the tried-and-true statistical translation methods that have been in ...Show More
Data and the Future of Medicine : Interview with Precision Medicine Initiative researcher Matt Might

34:54 | Dec 26th, 2016

Today we are delighted to bring you an interview with Matt Might, computer scientist and medical researcher extraordinaire and architect of President Obama's Precision Medicine Initiative. As the Obama Administration winds down, we're talking with M...Show More
Special Crossover Episode: Partially Derivative interview with White House Data Scientist DJ Patil

46:09 | Dec 18th, 2016

We have the pleasure of bringing you a very special crossover episode this week: our friends at Partially Derivative (another great podcast about data science, you should check it out) recently interviewed White House Chief Data Scientist DJ Patil. ...Show More
How to Lose at Kaggle

17:16 | Dec 12th, 2016

Competing in a machine learning competition on Kaggle is a kind of rite of passage for data scientists. Losing unexpectedly at the very end of the contest is also something that a lot of us have experienced. It's not just bad luck: a very specific ...Show More
Attacking Discrimination in Machine Learning

23:20 | Dec 5th, 2016

Imagine there's an important decision to be made about someone, like a bank deciding whether to extend a loan, or a school deciding to admit a student--unfortunately, we're all too aware that discrimination can sneak into these situations (even when ...Show More
Recurrent Neural Nets

12:36 | Nov 28th, 2016

This week, we're doing a crash course in recurrent neural networks--what the structural pieces are that make a neural net recurrent, how that structure helps RNNs solve certain time series problems, and the importance of forgetfulness in RNNs. R...Show More
Stealing a PIN with signal processing and machine learning

16:55 | Nov 21st, 2016

Want another reason to be paranoid when using the free coffee shop wifi? Allow us to introduce WindTalker, a system that cleverly combines a dose of signal processing with a dash of machine learning to (potentially) steal the PIN from your phone tra...Show More
Neural Net Cryptography

16:16 | Nov 14th, 2016

Cryptography used to be the domain of information theorists and spies. There's a new player now: neural networks. Given the task of communicating securely, neural networks are inventing new encryption methods that, as best we can tell, are unlike a...Show More
Deep Blue

20:05 | Nov 7th, 2016

In 1997, Deep Blue was the IBM algorithm/computer that did what no one, at the time, though possible: it beat the world's best chess player. It turns out, though, that one of the most important moves in the matchup, where Deep Blue psyched out its o...Show More
Organizing Google's Datasets

15:00 | Oct 31st, 2016

If you're a data scientist, there's a good chance you're used to working with a lot of data. But there's a lot of data, and then there's Google-scale amounts of data. Keeping all that data organized is a Google-sized task, and as it happens, they'v...Show More
Fighting Cancer with Data Science: Followup

25:48 | Oct 24th, 2016

A few months ago, Katie started on a project for the Vice President's Cancer Moonshot surrounding how data can be used to better fight cancer. The project is all wrapped up now, so we wanted to tell you about how that work went and what changes to c...Show More
The 19-year-old determining the US election

12:28 | Oct 17th, 2016

Sick of the presidential election yet? We are too, but there's still almost a month to go, so let's just embrace it together. This week, we'll talk about one of the presidential polls, which has been kind of an outlier for quite a while. This week...Show More
How to Steal a Model

13:36 | Oct 9th, 2016

What does it mean to steal a model? It means someone (the thief, presumably) can re-create the predictions of the model without having access to the algorithm itself, or the training data. Sound far-fetched? It isn't. If that person can ask for p...Show More
Regularization

17:27 | Oct 3rd, 2016

Lots of data is usually seen as a good thing. And it is a good thing--except when it's not. In a lot of fields, a problem arises when you have many, many features, especially if there's a somewhat smaller number of cases to learn from; supervised m...Show More
The Cold Start Problem

15:37 | Sep 26th, 2016

You might sometimes find that it's hard to get started doing something, but once you're going, it gets easier. Turns out machine learning algorithms, and especially recommendation engines, feel the same way. The more they "know" about a user, like ...Show More
Open Source Software for Data Science

20:05 | Sep 19th, 2016

If you work in tech, software or data science, there's an excellent chance you use tools that are built upon open source software. This is software that's built and distributed not for a profit, but because everyone benefits when we work together an...Show More
Scikit + Optimization = Scikit-Optimize

15:41 | Sep 12th, 2016

We're excited to welcome a guest, Tim Head, who is one of the maintainers of the scikit-optimize package. With all the talk about optimization lately, it felt appropriate to get in a few words with someone who's out there making it happen for python...Show More
Two Cultures: Machine Learning and Statistics

17:29 | Sep 5th, 2016

It's a funny thing to realize, but data science modeling is usually about either explainability, interpretation and understanding, or it's about predictive accuracy. But usually not both--optimizing for one tends to compromise the other. Leo Breim...Show More
Optimization Solutions

20:07 | Aug 29th, 2016

You've got an optimization problem to solve, and a less-than-forever amount of time in which to solve it. What do? Use a heuristic optimization algorithm, like a hill climber or simulated annealing--we cover both in this episode! Relevant link: ...Show More
Optimization Problems

17:50 | Aug 22nd, 2016

If modeling is about predicting the unknown, optimization tries to answer the question of what to do, what decision to make, to get the best results out of a given situation. Sometimes that's straightforward, but sometimes... not so much. What make...Show More
Multi-level modeling for understanding DEADLY RADIOACTIVE GAS

23:34 | Aug 15th, 2016

Ok, this episode is only sort of about DEADLY RADIOACTIVE GAS. It's mostly about multilevel modeling, which is a way of building models with data that has distinct, related subgroups within it. What are multilevel models used for? Elections (we ca...Show More
How Polls Got Brexit "Wrong"

15:14 | Aug 8th, 2016

Continuing the discussion of how polls do (and sometimes don't) tell us what to expect in upcoming elections--let's take a concrete example from the recent past, shall we? The Brexit referendum was, by and large, expected to shake out for "remain", ...Show More
Election Forecasting

28:59 | Aug 1st, 2016

Not sure if you heard, but there's an election going on right now. Polls, surveys, and projections about, as far as the eye can see. How to make sense of it all? How are the projections made? Which are some good ones to follow? We'll be your tru...Show More
Machine Learning for Genomics

20:22 | Jul 25th, 2016

Genomics data is some of the biggest #bigdata, and doing machine learning on it is unlocking new ways of thinking about evolution, genomic diseases like cancer, and what really makes each of us different for everyone else. This episode touches on so...Show More
Climate Modeling

19:49 | Jul 18th, 2016

Hot enough for you? Climate models suggest that it's only going to get warmer in the coming years. This episode unpacks those models, so you understand how they work. A lot of the episodes we do are about fun studies we hear about, like "if yo...Show More
Reinforcement Learning Gone Wrong

28:16 | Jul 11th, 2016

Last week’s episode on artificial intelligence gets a huge payoff this week—we’ll explore a wonderful couple of papers about all the ways that artificial intelligence can go wrong. Malevolent actors? You bet. Collateral damage? Of course. Reward...Show More
Reinforcement Learning for Artificial Intelligence

18:30 | Jul 3rd, 2016

There’s a ton of excitement about reinforcement learning, a form of semi-supervised machine learning that underpins a lot of today’s cutting-edge artificial intelligence algorithms. Here’s a crash course in the algorithmic machinery behind AlphaGo, ...Show More
Differential Privacy: how to study people without being weird and gross

18:17 | Jun 27th, 2016

Apple wants to study iPhone users' activities and use it to improve performance. Google collects data on what people are doing online to try to improve their Chrome browser. Do you like the idea of this data being collected? Maybe not, if it's bei...Show More
How the sausage gets made

29:13 | Jun 20th, 2016

Something a little different in this episode--we'll be talking about the technical plumbing that gets our podcast from our brains to your ears. As it turns out, it's a multi-step bucket brigade process of RSS feeds, links to downloads, and lots of h...Show More
SMOTE: makin' yourself some fake minority data

14:37 | Jun 13th, 2016

Machine learning on imbalanced classes: surprisingly tricky. Many (most?) algorithms tend to just assign the majority class label to all the data and call it a day. SMOTE is an algorithm for manufacturing new minority class examples for yourself, t...Show More
Conjoint Analysis: like AB testing, but on steroids

18:27 | Jun 6th, 2016

Conjoint analysis is like AB tester, but more bigger more better: instead of testing one or two things, you can test potentially dozens of options. Where might you use something like this? Well, if you wanted to design an entire hotel chain complet...Show More
Traffic Metering Algorithms

17:30 | May 30th, 2016

This episode is for all you (us) traffic nerds--we're talking about the hidden structure underlying traffic on-ramp metering systems. These systems slow down the flow of traffic onto highways so that the highways don't get overloaded with cars and c...Show More
Um Detector 2: The Dynamic Time Warp

14:00 | May 23rd, 2016

One tricky thing about working with time series data, like the audio data in our "um" detector (remember that? because we barely do...), is that sometimes events look really similar but one is a little bit stretched and squeezed relative to the othe...Show More
Inside a Data Analysis: Fraud Hunting at Enron

30:28 | May 16th, 2016

It's storytime this week--the story, from beginning to end, of how Katie designed and built the main project for Udacity's Intro to Machine Learning class, when she was developing the course. The project was to use email and financial data to hunt f...Show More
What's the biggest #bigdata?

25:31 | May 9th, 2016

Data science and is often mentioned in the same breath as big data. But how big is big data? And who has the biggest big data? CERN? Youtube? ... Something (or someone) else? Relevant link: http://journals.plos.org/plosbiology/article?id=...Show More
Data Contamination

20:58 | May 2nd, 2016

Supervised machine learning assumes that the features and labels used for building a classifier are isolated from each other--basically, that you can't cheat by peeking. Turns out this can be easier said than done. In this episode, we'll talk about...Show More
Model Interpretation (and Trust Issues)

16:57 | Apr 25th, 2016

Machine learning algorithms can be black boxes--inputs go in, outputs come out, and what happens in the middle is anybody's guess. But understanding how a model arrives at an answer is critical for interpreting the model, and for knowing if it's doi...Show More
Updates! Political Science Fraud and AlphaGo

31:43 | Apr 18th, 2016

We've got updates for you about topics from past shows! First, the political science scandal of the year 2015 has a new chapter, we'll remind you about the original story and then dive into what has happened since. Then, we've got an update on Alph...Show More
Ecological Inference and Simpson's Paradox

18:32 | Apr 11th, 2016

Simpson's paradox is the data science equivalent of looking through one eye and seeing a very clear trend, and then looking through the other eye and seeing the very clear opposite trend. In one case, you see a trend one way in a group, but then bre...Show More
Discriminatory Algorithms

15:21 | Apr 4th, 2016

Sometimes when we say an algorithm discriminates, we mean it can tell the difference between two types of items. But in this episode, we'll talk about another, more troublesome side to discrimination: algorithms can be... racist? Sexist? Ageist? ...Show More
Recommendation Engines and Privacy

31:33 | Mar 28th, 2016

This episode started out as a discussion of recommendation engines, like Netflix uses to suggest movies. There's still a lot of that in here. But a related topic, which is both interesting and important, is how to keep data private in the era of la...Show More
Neural nets play cops and robbers (AKA generative adverserial networks)

18:56 | Mar 21st, 2016

One neural net is creating counterfeit bills and passing them off to a second neural net, which is trying to distinguish the real money from the fakes. Result: two neural nets that are better than either one would have been without the competition. ...Show More
A Data Scientist's View of the Fight against Cancer

19:08 | Mar 14th, 2016

In this episode, we're taking many episodes' worth of insights and unpacking an extremely complex and important question--in what ways are we winning the fight against cancer, where might that fight go in the coming decade, and how do we know when we...Show More
Congress Bots and DeepDrumpf

20:47 | Mar 11th, 2016

Hey, sick of the election yet? Fear not, there are algorithms that can automagically generate political-ish speech so that we never need to be without an endless supply of Congressional speeches and Donald Trump twitticisms! Relevant links: ht...Show More
Multi - Armed Bandits

11:29 | Mar 7th, 2016

Multi-armed bandits: how to take your randomized experiment and make it harder better faster stronger. Basically, a multi-armed bandit experiment allows you to optimize for both learning and making use of your knowledge at the same time. It's what ...Show More
Experiments and Messy, Tricky Causality

16:59 | Mar 4th, 2016

"People with a family history of heart disease are more likely to eat healthy foods, and have a high incidence of heart attacks." Did the healthy food cause the heart attacks? Probably not. But establishing causal links is extremely tricky, and ex...Show More
Backpropagation

12:21 | Feb 29th, 2016

The reason that neural nets are taking over the world right now is because they can be efficiently trained with the backpropagation algorithm. In short, backprop allows you to adjust the weights of the neural net based on how good of a job the neura...Show More
Text Analysis on the State Of The Union

22:22 | Feb 26th, 2016

First up in this episode: a crash course in natural language processing, and important steps if you want to use machine learning techniques on text data. Then we'll take that NLP know-how and talk about a really cool analysis of State of the Union t...Show More
Paradigms in Artificial Intelligence

17:20 | Feb 22nd, 2016

Artificial intelligence includes a number of different strategies for how to make machines more intelligent, and often more human-like, in their ability to learn and solve problems. An ambitious group of researchers is working right now to classify ...Show More
Survival Analysis

15:21 | Feb 19th, 2016

Survival analysis is all about studying how long until an event occurs--it's used in marketing to study how long a customer stays with a service, in epidemiology to estimate the duration of survival of a patient with some illness, and in social scien...Show More
Gravitational Waves

20:26 | Feb 15th, 2016

All aboard the gravitational waves bandwagon--with the first direct observation of gravitational waves announced this week, Katie's dusting off her physics PhD for a very special gravity-related episode. Discussed in this episode: what are gravitati...Show More
The Turing Test

15:15 | Feb 12th, 2016

Let's imagine a future in which a truly intelligent computer program exists. How would it convince us (humanity) that it was intelligent? Alan Turing's answer to this question, proposed over 60 years ago, is that the program could convince a human ...Show More
Item Response Theory: how smart ARE you?

11:46 | Feb 8th, 2016

Psychometrics is all about measuring the psychological characteristics of people; for example, scholastic aptitude. How is this done? Tests, of course! But there's a chicken-and-egg problem here: you need to know both how hard a test is, and how s...Show More
Go!

19:59 | Feb 5th, 2016

As you may have heard, a computer beat a world-class human player in Go last week. As recently as a year ago the prediction was that it would take a decade to get to this point, yet here we are, in 2016. We'll talk about the history and strategy of...Show More
Great Social Networks in History

12:42 | Feb 1st, 2016

The Medici were one of the great ruling families of Europe during the Renaissance. How did they come to rule? Not power, or money, or armies, but through the strength of their social network. And speaking of great historical social networks, analy...Show More
How Much to Pay a Spy (and a lil' more auctions)

16:59 | Jan 29th, 2016

A few small encores on auction theory, and then--how can you value a piece of information before you know what it is? Decision theory has some pointers. Some highly relevant information if you are trying to figure out how much to pay a spy. Rele...Show More
Sold! Auctions (Part 2)

17:27 | Jan 25th, 2016

The Google ads auction is a special kind of auction, one you might not know as well as the famous English auction (which we talked about in the last episode). But if it's what Google uses to sell billions of dollars of ad space in real time, you kno...Show More
Going Once, Going Twice: Auctions (Part 1)

12:39 | Jan 22nd, 2016

The Google AdWords algorithm is (famously) an auction system for allocating a massive amount of online ad space in real time--with that fascinating use case in mind, this episode is part one in a two-part series all about auctions. We dive into the ...Show More
Chernoff Faces and Minard Maps

15:11 | Jan 18th, 2016

A data visualization extravaganza in this episode, as we discuss Chernoff faces (you: "faces? huh?" us: "oh just you wait") and the greatest data visualization of all time, or at least the Napoleonic era. Relevant links: http://lya.fciencias.unam...Show More
t-SNE: Reduce Your Dimensions, Keep Your Clusters

16:55 | Jan 15th, 2016

Ever tried to visualize a cluster of data points in 40 dimensions? Or even 4, for that matter? We prefer to stick to 2, or maybe 3 if we're feeling well-caffeinated. The t-SNE algorithm is one of the best tools on the market for doing dimensionali...Show More
The [Expletive Deleted] Problem

09:54 | Jan 11th, 2016

The town of [expletive deleted], England, is responsible for the clbuttic [expletive deleted] problem. This week on Linear Digressions: we try really hard not to swear too much. Related links: https://en.wikipedia.org/wiki/Scunthorpe_problem ht...Show More
Unlabeled Supervised Learning--whaaa?

12:35 | Jan 8th, 2016

In order to do supervised learning, you need a labeled training dataset. Or do you...? Relevant links: http://www.cs.columbia.edu/~dplewis/candidacy/goldman00enhancing.pdf
Hacking Neural Nets

15:28 | Jan 5th, 2016

Machine learning: it can be fooled, just like you or me. Here's one of our favorite examples, a study into hacking neural networks. Relevant links: http://arxiv.org/pdf/1412.1897v4.pdf
Zipf's Law

11:43 | Dec 31st, 2015

Zipf's law is related to the statistics of how word usage is distributed. As it turns out, this is also strikingly reminiscent of how income is distributed, and populations of cities, and bug reports in software, as well as tons of other phenomena t...Show More
Indie Announcement

01:19 | Dec 30th, 2015

We've gone indie! Which shouldn't change anything about the podcast that you know and love, but we're super excited to keep bringing you Linear Digressions as a fully independent podcast. Some links mentioned in the show: https://twitter.com/lin...Show More
Portrait Beauty

11:44 | Dec 27th, 2015

It's Da Vinci meets Skynet: what makes a portrait beautiful, according to a machine learning algorithm. Snap a selfie and give us a listen.
The Cocktail Party Problem

12:04 | Dec 18th, 2015

Grab a cocktail, put on your favorite karaoke track, and let’s talk some more about disentangling audio data!
A Criminally Short Introduction to Semi Supervised Learning

09:12 | Dec 4th, 2015

Because there are more interesting problems than there are labeled datasets, semi-supervised learning provides a framework for getting feedback from the environment as a proxy for labels of what's "correct." Of all the machine learning methodologies...Show More
Thresholdout: Down with Overfitting

15:52 | Nov 27th, 2015

Overfitting to your training data can be avoided by evaluating your machine learning algorithm on a holdout test dataset, but what about overfitting to the test data? Turns out it can be done, easily, and you have to be very careful to avoid it. Bu...Show More
The State of Data Science

15:40 | Nov 10th, 2015

How many data scientists are there, where do they live, where do they work, what kind of tools do they use, and how do they describe themselves? RJMetrics wanted to know the answers to these questions, so they decided to find out and share their ana...Show More
Data Science for Making the World a Better Place

09:31 | Nov 6th, 2015

There's a good chance that great data science is going on close to you, and that it's going toward making your city, state, country, and planet a better place. Not all the data science questions being tackled out there are about finding the sleekest...Show More
Kalman Runners

14:42 | Oct 29th, 2015

The Kalman Filter is an algorithm for taking noisy measurements of dynamic systems and using them to get a better idea of the underlying dynamics than you could get from a simple extrapolation. If you've ever run a marathon, or been a nuclear missil...Show More
Neural Net Inception

15:19 | Oct 23rd, 2015

When you sleep, the neural pathways in your brain take the "white noise" of your resting brain, mix in your experiences and imagination, and the result is dreams (that is a highly unscientific explanation, but you get the idea). What happens when ne...Show More
Benford's Law

17:42 | Oct 16th, 2015

Sometimes numbers are... weird. Benford's Law is a favorite example of this for us--it's a law that governs the distribution of the first digit in certain types of numbers. As it turns out, if you're looking up the length of a river, the population...Show More
Guinness

14:43 | Oct 7th, 2015

Not to oversell it, but the student's t-test has got to have the most interesting history of any statistical test. Which is saying a lot, right? Add some boozy statistical trivia to your arsenal in this epsiode.
PFun with P Values

17:07 | Sep 2nd, 2015

Doing some science, and want to know if you might have found something? Or maybe you've just accomplished the scientific equivalent of going fishing and reeling in an old boot? Frequentist p-values can help you distinguish between "eh" and "oooh in...Show More
Watson

15:36 | Aug 25th, 2015

This machine learning algorithm beat the human champions at Jeopardy. What is... Watson?
Bayesian Psychics

11:44 | Aug 18th, 2015

Come get a little "out there" with us this week, as we use a meta-study of extrasensory perception (or ESP, often used in the same sentence as "psychics") to chat about Bayesian vs. frequentist statistics.
Troll Detection

12:57 | Aug 7th, 2015

Ever found yourself wasting time reading online comments from trolls? Of course you have; we've all been there (it's 4 AM but I can't turn off the computer and go to sleep--someone on the internet is WRONG!). Now there's a way to use machine learni...Show More
Yiddish Translation

12:15 | Aug 3rd, 2015

Imagine a language that is mostly spoken rather than written, contains many words in other languages, and has relatively little written overlap with English. Now imagine writing a machine-learning-based translation system that can convert that langu...Show More
Modeling Particles in Atomic Bombs

15:38 | Jul 6th, 2015

In a fun historical journey, Katie and Ben explore the history of the Manhattan Project, discuss the difficulties in modeling particle movement in atomic bombs with only punch-card computers and ingenuity, and eventually come to present-day uses of t...Show More
Random Number Generation

10:26 | Jun 19th, 2015

Let's talk about randomness! Although randomness is pervasive throughout the natural world, it's surprisingly difficult to generate random numbers. And even if your numbers look random (but actually aren't), it can have interesting consequences on th...Show More
Electoral Insights (Part 2)

21:18 | Jun 9th, 2015

Following up on our last episode about how experiments can be performed in political science, now we explore a high-profile case of an experiment gone wrong. An extremely high-profile paper that was published in 2014, about how talking to people ...Show More
Electoral Insights (Part 1)

09:17 | Jun 5th, 2015

The first of our two-parter discussing the recent electoral data fraud case. The results of the study in question were covered widely, including by This American Life (who later had to issue a retraction). Data science for election research involv...Show More
Falsifying Data

17:46 | Jun 1st, 2015

In the first of a few episodes on fraud in election research, we’ll take a look at a case study from a previous Presidential election, where polling results were faked. What are some telltale signs that data fraud might be present in a dataset? ...Show More
Reporter Bot

11:15 | May 20th, 2015

There’s a big difference between a table of numbers or statistics, and the underlying story that a human might tell about how those numbers were generated. Think about a baseball game—the game stats and a newspaper story are describing the same t...Show More
Careers in Data Science

16:35 | May 16th, 2015

Let’s talk money. As a “hot” career right now, data science can pay pretty well. But for an individual person matched with a specific job or industry, how much should someone expect to make? Since Katie was on the job market lately, this was some...Show More
That's "Dr Katie" to You

03:01 | May 14th, 2015

Katie successfully defended her thesis! We celebrate her return, and talk a bit about what getting a PhD in Physics is like.
Neural Nets (Part 2)

10:55 | May 11th, 2015

In the last episode, we zipped through neural nets and got a quick idea of how they work and why they can be so powerful. Here’s the real payoff of that work: In this episode, we’ll talk about a brand-new pair of results, one from Stanford and one...Show More
Neural Nets (Part 1)

09:00 | May 1st, 2015

There is no known learning algorithm that is more flexible and powerful than the human brain. That's quite inspirational, if you think about it--to level up machine learning, maybe we should be going back to biology and letting millions of year of ev...Show More
Inferring Authorship (Part 2)

14:04 | Apr 28th, 2015

Now that we’re up to speed on the classic author ID problem (who wrote the unsigned Federalist Papers?), we move onto a couple more contemporary examples. First, J.K. Rowling was famously outed using computational linguistics (and Twitter) when s...Show More