Fixing Random, part 16

This series is getting quite long and we’re not done yet! This would be a good time to quickly review where we’re at:

We’re representing a particular discrete probability distribution P(A) over a small number of members of a particular type A by IDiscreteDistribution<A>.
We can condition a distribution — by discarding certain possibilities from it — with Where.
We can project a distribution from one type to another with Select.
A conditional probability P(B|A) — the probability of B given that some A is true — is represented as likelihood function of type Func<A, IDiscreteDistribution<B>>.
We can “bind” a likelihood function onto a prior distribution with SelectManyto produce a joint distribution.

These are all good results, and I hope you agree that we have already produced a much richer and more powerful abstraction over randomness than System.Random provides. But in today’s episode everything is really going to come together to reveal that we can use these tools to solve interesting problems in probabilistic inference.

To show how, we’ll need to start by reviewing Bayes’ Theorem.

If we have a prior P(A), and a likelihood P(B|A), we know that we can “bind” them together to form the joint distribution. That is, the probability of A and B both happening is the probability of A multiplied by the probability that B happens given that A has happened:

P(A&B) = P(A) P(B|A)

Obviously that goes the other way. If we have P(B) as our prior, and P(A|B) as our likelihood, then:

P(B&A) = P(B) P(A|B)

But A&B is the same as B&A, and things equal to the same are equal to each other. Therefore:

P(A) P(B|A) = P(B) P(A|B)

Let’s suppose that P(A) is our prior and P(B|A) is our likelihood. In the equation above the term P(A|B) is called the posterior and can be computed like this:

P(A|B) = P(A) P(B|A) / P(B)

I hope that is clear, but let’s move away from the abstract mathematics and illustrate an example by using the code we’ve written so far.

We can step back a few episodes and re-examine our prior and likelihood example for Frob Syndrome. Recall that this was a made-up study of a made-up condition which we believe may be linked to height. We’ll use the weights from the original episode.

That is to say: we have P(Height), we have likelihood function P(Severity|Height), and we wish to first compute the joint probability distribution P(Height&Severity):

This produces:

(Tall, Severe):850
(Tall, Moderate):935
(Medium, Moderate):504
(Medium, Mild):210
(Short, Mild):357

Now the question is: what is the posterior, P(Height|Severity)? Remember what this is: it is a function that takes a severity, and returns a distribution of heights.

We can compute the marginal probabilities “by hand” by looking at the weights above:

If symptoms are severe, there is a 100% chance that the person is tall.
If symptoms are moderate, 935 study members are tall for every 504 medium-height members.
If symptoms are mild, then that’s 210 medium people for every 357 short.

We could implement that easily enough; it’s just another function like we’ve seen many times before in this series:

IDiscreteDistribution<Height> posterior(Severity s)
{
switch(s) … blah blah blah …

But I don’t want to have a human analyze the data and write the code. We have enough information in the IDiscreteDistribution<(Height, Severity)> to generate a Func<Severity<IDiscreteDistribution>.

In fact, we can simply add another clause to our query:

IDiscreteDistribution<Height> posterior(Severity s) =>
from pair in joint
where pair.s == s
select pair.h;

We can compute the posterior with a Where clause!

Recall that what we are computing here is logically P(A&B)/P(B); just as SelectMany can be thought of as a sort of multiplication, apparently Where is logically a sort of division.

But let’s not stop here; we can make a general rule in the form of an extension method, and I’m going to slap a projection onto the back side of it just for added generality because why not:

public static Func<B, IDiscreteDistribution<C>> Posterior<A, B, C>(
  this IDiscreteDistribution<A> prior,
  Func<A, IDiscreteDistribution<B>> likelihood,
  Func<A, B, C> projection) =>
b => from a in prior
from bb in likelihood(a)
where object.Equals(b, bb)
select projection(a, b);
public static Func<B, IDiscreteDistribution<A>> Posterior<A, B>(
this IDiscreteDistribution<A> prior,
Func<A, IDiscreteDistribution<B>> likelihood) =>
Posterior(prior, likelihood, (a, b) => a);

Let’s take it for a spin.

Question: Given the prior distribution and the likelihood function, what is the posterior distribution of height amongst the study members with moderate symptoms?

var posterior = prior.Posterior(likelihood);
Console.WriteLine(posterior(Moderate).ShowWeights());

And sure enough, we get a probability distribution out that matches what we could have computed by hand:

Tall:935
Medium:504

OK, that’s pretty neat, but why is this relevant?

Because Bayesian inference is incredibly important, and incredibly easy to get wrong! Anything we can do to improve developers’ ability to use Bayesian analysis correctly is a win.

Let’s look at another example. Almost a decade ago I did a blog post where I discussed how Bayesian inference is counterintuitive. Let’s run the numbers from that blog post through our system and see what we get.

We have a disease

enum TappetsDisease { Sick, Healthy }

and our prior is that 99% of the population is healthy:

var prior = new List<TappetsDisease> { Sick, Healthy }
.ToWeighted(1, 99);

We also have a test:

enum JethroTest { Positive, Negative }

And the test is 99% accurate. That is, if you are sick, it has a 99% chance of “positive”, and if you are healthy, it has a 99% chance of “negative”:

var tests = new List<JethroTest> { Positive, Negative };
IDiscreteDistribution<JethroTest> likelihood(TappetsDisease d) =>
d == Sick ? tests.ToWeighted(99, 1) : tests.ToWeighted(1, 99);

Aside: You might wonder how we know that the test is 99% accurate, and how we know that 1% of the population has the disease, particularly given the counterintuitive result I’m about to discuss. That’s a great question and I’m not going to get into the details in this series of how in the real world medical practitioners evaluate the accuracy of a test or the prevalence of a condition. Let’s just suppose that we know these facts ahead of time; after all, that’s why the prior is called the prior.

Question: you have just tested positive; what is the probability that you have the disease?

Most people, and even many doctors, will say “the test is 99% accurate, you tested positive, therefore there is a 99% chance that you have the disease”. But that is not at all true; we can compute the true result very easily now:

var posterior = prior.Posterior(likelihood);
Console.WriteLine(posterior(Positive).ShowWeights());

And we get:

Sick:1
Healthy:1

It’s fifty-fifty.

Why?

If a result is confusing, always look at the joint distribution:

Console.WriteLine(prior.Joint(likelihood).ShowWeights());

(Sick, Positive):99
(Sick, Negative):1
(Healthy, Positive):99
(Healthy, Negative):9801

You tested positive. 99 out of every 10000 people are true positives, and 99 out of every 10000 people are false positives. We condition away the negatives, because you didn’t test negative, and what is left? 50% chance that you are positive, not 99%.

Aside: In this example if you test negative then you are not 99% likely to be negative; you are 99.99% likely to be negative! This is also counterintuitive to people.

Exercise: How good does the test have to be for you to have a 90% posterior probability of actually being positive given a positive result?

Bayesian inference is incredibly powerful and useful. We very frequently have good information on priors and likelihoods. We make observations of the world, and we need to figure out posteriors probabilities given those observations. I could list examples all day; a classic example in information technology is:

We can ask users to manually classify emails into spam and non-spam. That gives us a prior on P(Spam)
From that collection of spam and non-spam emails, we can find out which words are commonly found only in spam. That gives us a likelihood function, P(Words|Spam).
We then make an observation of a real email, and the question is: given the words in an email, what is the posterior probability that it is spam? That is, what is the function P(Spam|Words). If the probability passes some threshold, we can put the mail in the spam folder.

There are also real applications in sensor technology:

We have a machine in a factory which requires a part on a conveyor to stop moving before it is welded; we manually observe how often the part is stopped correctly, giving us a prior on P(Stopped)
We install a sensor that attempts to sense whether the part is stopped, and test its accuracy to obtain P(SensorReading|Stopped)
Now we have enough information to compute the posterior: given a certain reading from the sensor, what is the probability that the part has actually stopped moving? That is P(Stopped|SensorReading)
If we do not have a high enough probability that the part is actually stopped, we can delay the welding step until we have better evidence that the part has stopped.

There are even applications in developer tools!

We can gather information from open source repositories about how often certain functions are called, giving us a prior on P(Function called)
We can gather information from IDE keystrokes about how often a letter typed is ultimately the first letter of that function, giving us a likelihood function P(Keystrokes|Function called)
Now we have enough information to compute the posterior: given a certain set of recent keystrokes, what is the probability distribution on likely functions the user wishes to call? This could give us much better IntelliSense.

And so on. The opportunities for taking advantage of Bayesian inference are enormous. We really ought to have Bayesian inference on distributions in the basic toolbox of the language, the same way we have ints, doubles, strings, nullables, functions, tasks, sequences, and so on, in that toolbox.

That’s what I mean by “Fixing Random”. The fundamental problem is not that Random has historically had a candy-machine interface; that’s just a silly historical accident that can be fixed. Rather: we’ve decided that monads like nullable, sequence, function and task are so important that they are included in the core runtime. Why? Not because they’re cool, but because having Nullable<T>, IEnumerable<T>, Task<T>, and so on in the core runtime makes it much easier for developers to write correct, concise code that solves their problems.

Programming is increasingly about dealing with a world of unknowns; having operators in the language for concisely describing probabilistic workflows seems very valuable to me. This series seeks to make the case for that value.

Next time on FAIC: We’ll take a closer look at the discrete probability distribution type as a monad. We might be missing a concept.

7 thoughts on “Fixing Random, part 16”

sollyucko on March 25, 2019 at 6:40 am said:

I personally find it much easier to think about it in terms of entropy as opposed to probability.

Reply ↓
- ericlippert on March 25, 2019 at 7:10 am said:
  
  I’m not going to make that connection explicitly in this series, though I think I have already mentioned that using the log of probability as the weight is a nice way to do it. There’s a natural connection to the entropy when you do it that way.
  
  Reply ↓
Семен Холоднов on March 25, 2019 at 10:09 am said:

> things equal to the same are equal to each other.
…unless they’re real, heh. Also, see: L. Carroll, “What the Tortoise Said to Achilles”.

Now that that joke is out of the way, the passage on how “SelectMany” can be thought of as a sort of multiplication, and “Where” as a sort of division made think of relational division for some reason. Given that probabilitiy theory can be seen as a sort-of extension of Boolean algebra, I wonder what a similar extension of predicate logic (or theory of relations) would look like.

Reply ↓
Pingback: Dew Drop – March 26, 2019 (#2926) | Morning Dew
Jacob Bundgaard on April 13, 2019 at 1:24 pm said:

Isn’t the end of the sentence “But I don’t want to have a human analyze the data and write the code. We have enough information in the IDiscreteDistribution to generate a Func<Severity.” missing something?

Reply ↓
Pingback: Fixing Random, bonus episode 2: pigeons and the noisy-or distribution | Fabulous adventures in coding
jmblakl on December 23, 2019 at 8:24 am said:

This video just came out from 3Blue1Brown last night describing Bayes Theorem. His channel is good in general for Math enthusiasts, but this one in particular is good background on probability.

Reply ↓