Fixing Random, part 17

Before we get going on today’s episode of FAIC, you might want to refresh your memory of what an additive monad is; I wrote an episode of my monad series on this subject. Briefly, an additive monad is a monad where there is a “zero value”; like the number zero, “multiplying” by zero produces a zero, and “adding” a zero is an identity.

For example, the sequence monad, IEnumerable<T>, has a zero: the empty sequence. If we Select or SelectMany from the empty sequence — our analog of “multiplying” — we get an empty sequence.  If we concatenate an empty sequence onto another sequence — the sequence analog of “adding” — we get the original sequence.

All additive monads can have a Where function defined on them; if we wanted to implement Where for sequences and didn’t care about performance, we could implement it like this:

public static IEnumerable<T> Single<T>(T t)
  yield return t;
public static IEnumerable<T> Zero<T>()
  yield break;
// Non-standard Where:
public static IEnumerable<T> Where<T>(
    this IEnumerable<T> items,
    Func<T, bool> predicate) =>
  from a in items
  from b in predicate(a) ? Single(a) : Zero<T>()
  select b;

That’s slow and produces hideous collection pressure, but it works; our actual implementation of Where is just an optimization.

What about the converse? Our probability monad IDiscreteDistribution<T> has a Where function defined. We definitely have a Singleton<T> type. But our implementation of the distribution monad does not appear to have a zero value. It seems plausible that there should be a way to express Where on distributions as we did with the sequence monad: as a SelectMany that produces either the single or zero distributions based on the predicate.

Give that some thought, and then scroll down when you think you have it worked out.







Just as the zero of the sequence monad is the empty sequence, the zero of the distribution monad is the empty distribution. That is, the distribution with an empty support that throws every time it is sampled.

We never implemented this value because every distribution class we’ve created already throws when you try to create an empty distribution:

  • StandardDiscreteInteger throws if the range is empty.
  • Bernoulli and WeightedInteger both throw if you give them all zero weights.
  • In our current implementation a Where clause where the predicate is false for everything in the support of the underlying collection will eventually throw.
  • In our original implementation, a Where clause where the predicate is always false hangs when sampled, but does not throw.
  • Our implementation of Select throws if the support is empty.
  • And so on.

Exercise: We have learned the following facts:

  • The zero value of the discrete distribution monad is the empty distribution.
  • The joint distribution produced by SelectMany is the analog of multiplication of two distributions.
  • Concatenation is the “addition” of the sequence monad. (The two sequences have to be of the same element type.)

I think it is pretty clear that doing a SelectMany on an empty distribution has to produce an empty distribution. But we still have a mystery to solve: what is the addition operator on two discrete distributions? They have to be of the same element type. The addition operator has to have the property that adding zero to any distribution is an identity, but what does it mean to add together two non-zero distributions?

Answer in the comments with your thoughts.

It turns out that there are some uses for an explicit empty distribution; we’ll discover what the specific benefits of it are in a later episode.

What are the costs? I don’t mean implementation costs, but rather, what are the down sides to developers of having this feature? In short: if we go down this road, what new opportunities for bugs are we producing?

One interesting cost is that we will defer an operation that can throw; this can be very confusing! A classic source of StackOverflow questions is when someone writes an enumerator block:

static IEnumerable<int> Foo(string bar)
  if (bar == null)
    throw new ArgumentNullException();
  yield return bar.Length;

and then calls it:

var foo = Foo(accidentallyNullThing); // no throw
foreach (int x in foo) // throw!

The source of the problem is that the throw is delayed. If you look at the proper, industrial-strength implementations of Where, Select and so on, you’ll notice that each one is written in a style where it validates its arguments first, and then returns a call to a helper method that actually does the iteration. That way the exception is thrown close to the point of the mistake.

However, that doesn’t fix other common variations on the problem. For example, you might have some buggy code that produces an empty sequence sometimes, and then a thousand lines later you call First on the sequence and it blows up, but the bug is where the sequence is produced.

And of course this is really no different than nullable types that blow up when we forget that they can be null; a nullable T is logically a sequence of T where the sequence length is either zero or one, and if we forget that it can be “zero length”, we get into trouble.

The empty distribution will have the same property: it will be easy to create it by accident in a buggy program and it will not blow up until it is sampled, just as nullable reference types do not blow up until they are dereferenced.

That said, we’re going to do it because the benefits are actually pretty compelling, oddly enough.

Next time on FAIC: In the next regularly-scheduled episode we will implement the empty distribution; it’ll be quite straightforward, but it will necessitate fixing up some of our existing code. However, before then I’m going to interrupt this series with a very special episode that addresses a long-standing puzzler in probability theory which I just realized we now have all the gear we need to answer. Stay tuned!



Fixing Random, part 16

[Code is here.]

This series is getting quite long and we’re not done yet! This would be a good time to quickly review where we’re at:

  • We’re representing a particular discrete probability distribution P(A) over a small number of members of a particular type A by IDiscreteDistribution<A>.
  • We can condition a distribution — by discarding certain possibilities from it — with Where.
  • We can project a distribution from one type to another with Select.
  • A conditional probability P(B|A) — the probability of B given that some A is true — is represented as likelihood function of type Func<A, IDiscreteDistribution<B>>.
  • We can “bind” a likelihood function onto a prior distribution with SelectManyto produce a joint distribution.

These are all good results, and I hope you agree that we have already produced a much richer and more powerful abstraction over randomness than System.Random provides. But in today’s episode everything is really going to come together to reveal that we can use these tools to solve interesting problems in probabilistic inference.

To show how, we’ll need to start by reviewing Bayes’ Theorem.

If we have a prior P(A), and a likelihood P(B|A), we know that we can “bind” them together to form the joint distribution. That is, the probability of A and B both happening is the probability of A multiplied by the probability that B happens given that A has happened:

P(A&B) = P(A) P(B|A)

Obviously that goes the other way. If we have P(B) as our prior, and P(A|B) as our likelihood, then:

P(B&A) = P(B) P(A|B)

But A&B is the same as B&A, and things equal to the same are equal to each other. Therefore:

P(A) P(B|A) = P(B) P(A|B)

Let’s suppose that P(A) is our prior and P(B|A) is our likelihood. In the equation above the term P(A|B) is called the posterior and can be computed like this:

P(A|B) = P(A) P(B|A) / P(B)

I hope that is clear, but let’s move away from the abstract mathematics and illustrate an example by using the code we’ve written so far.

We can step back a few episodes and re-examine our prior and likelihood example for Frob Syndrome. Recall that this was a made-up study of a made-up condition which we believe may be linked to height. We’ll use the weights from the original episode.

That is to say: we have P(Height), we have likelihood function P(Severity|Height), and we wish to first compute the joint probability distribution P(Height&Severity):

var heights = new List<Height() { Tall, Medium, Short }
var prior = heights.ToWeighted(5, 2, 1);
IDiscreteDistribution<Severity> likelihood(Height h)
    case Tall: return severity.ToWeighted(10110);
    case Medium: return severity.ToWeighted(0125);
    defaultreturn severity.ToWeighted(001);
var joint = prior.Joint(likelihood);      Console.WriteLine(joint.ShowWeights());

This produces:

(Tall, Severe):850
(Tall, Moderate):935
(Medium, Moderate):504
(Medium, Mild):210
(Short, Mild):357

Now the question is: what is the posterior, P(Height|Severity)? Remember what this is:  it is a function that takes a severity, and returns a distribution of heights.

We can compute the marginal probabilities “by hand” by looking at the weights above:

  • If symptoms are severe, there is a 100% chance that the person is tall.
  • If symptoms are moderate, 935 study members are tall for every 504 medium-height members.
  • If symptoms are mild, then that’s 210 medium people for every 357 short.

We could implement that easily enough; it’s just another function like we’ve seen many times before in this series:

IDiscreteDistribution<Height> posterior(Severity s)
  switch(s) … blah blah blah …

But I don’t want to have a human analyze the data and write the code. We have enough information in the IDiscreteDistribution<(Height, Severity)> to generate a Func<Severity<IDiscreteDistribution>.

In fact, we can simply add another clause to our query:

IDiscreteDistribution<Height> posterior(Severity s) => 
  from pair in joint
  where pair.s == s
  select pair.h;

We can compute the posterior with a Where clause!

Recall that what we are computing here is logically P(A&B)/P(B); just as SelectMany can be thought of as a sort of multiplication, apparently Where is logically a sort of division.

But let’s not stop here; we can make a general rule in the form of an extension method, and I’m going to slap a projection onto the back side of it just for added generality because why not:

public static Func<B, IDiscreteDistribution<C>> Posterior<A, B, C>(
    this IDiscreteDistribution<A> prior,
    Func<A, IDiscreteDistribution<B>> likelihood,
    Func<A, B, C> projection) =>
  b => from a in prior
       from bb in likelihood(a)
       where object.Equals(b, bb)
       select projection(a, b);
public static Func<BIDiscreteDistribution<A>> Posterior<AB>(
    this IDiscreteDistribution<A> prior,
    Func<AIDiscreteDistribution<B>> likelihood) =>
Posterior(prior, likelihood, (a, b) => a);

Let’s take it for a spin.

Question: Given the prior distribution and the likelihood function, what is the posterior distribution of height amongst the study members with moderate symptoms?

var posterior = prior.Posterior(likelihood);

And sure enough, we get a probability distribution out that matches what we could have computed by hand:


OK, that’s pretty neat, but why is this relevant?

Because Bayesian inference is incredibly important, and incredibly easy to get wrong! Anything we can do to improve developers’ ability to use Bayesian analysis correctly is a win.

Let’s look at another example. Almost a decade ago I did a blog post where I discussed how Bayesian inference is counterintuitive. Let’s run the numbers from that blog post through our system and see what we get.

We have a disease

enum TappetsDisease { Sick, Healthy }

and our prior is that 99% of the population is healthy:

var prior = new List<TappetsDisease> { Sick, Healthy }
  .ToWeighted(1, 99);

We also have a test:

enum JethroTest { Positive, Negative }

And the test is 99% accurate. That is, if you are sick, it has a 99% chance of “positive”, and if you are healthy, it has a 99% chance of “negative”:

var tests = new List<JethroTest> { Positive, Negative };
<JethroTest> likelihood(TappetsDisease d) =>
  d == Sick ? tests.ToWeighted(99, 1) : tests.ToWeighted(1, 99);

Aside: You might wonder how we know that the test is 99% accurate, and how we know that 1% of the population has the disease, particularly given the counterintuitive result I’m about to discuss. That’s a great question and I’m not going to get into the details in this series of how in the real world medical practitioners evaluate the accuracy of a test or the prevalence of a condition. Let’s just suppose that we know these facts ahead of time; after all, that’s why the prior is called the prior.

Question: you have just tested positive; what is the probability that you have the disease?

Most people, and even many doctors, will say “the test is 99% accurate, you tested positive, therefore there is a 99% chance that you have the disease”. But that is not at all true; we can compute the true result very easily now:

var posterior = prior.Posterior(likelihood);

And we get:


It’s fifty-fifty.


If a result is confusing, always look at the joint distribution:


(Sick, Positive):99
(Sick, Negative):1
(Healthy, Positive):99
(Healthy, Negative):9801

You tested positive. 99 out of every 10000 people are true positives, and 99 out of every 10000 people are false positives. We condition away the negatives, because you didn’t test negative, and what is left? 50% chance that you are positive, not 99%.

Aside: In this example if you test negative then you are not 99% likely to be negative; you are 99.99% likely to be negative! This is also counterintuitive to people.

Exercise: How good does the test have to be for you to have a 90% posterior probability of actually being positive given a positive result?

Bayesian inference is incredibly powerful and useful. We very frequently have good information on priors and likelihoods. We make observations of the world, and we need to figure out posteriors probabilities given those observations. I could list examples all day; a classic example in information technology is:

  • We can ask users to manually classify emails into spam and non-spam. That gives us a prior on P(Spam)
  • From that collection of spam and non-spam emails, we can find out which words are commonly found only in spam. That gives us a likelihood function, P(Words|Spam).
  • We then make an observation of a real email, and the question is: given the words in an email, what is the posterior probability that it is spam? That is, what is the function P(Spam|Words). If the probability passes some threshold, we can put the mail in the spam folder.

There are also real applications in sensor technology:

  • We have a machine in a factory which requires a part on a conveyor to stop moving before it is welded; we manually observe how often the part is stopped correctly, giving us a prior on P(Stopped)
  • We install a sensor that attempts to sense whether the part is stopped, and test its accuracy to obtain P(SensorReading|Stopped)
  • Now we have enough information to compute the posterior: given a certain reading from the sensor, what is the probability that the part has actually stopped moving? That is P(Stopped|SensorReading)
  • If we do not have a high enough probability that the part is actually stopped, we can delay the welding step until we have better evidence that the part has stopped.

There are even applications in developer tools!

  • We can gather information from open source repositories about how often certain functions are called, giving us a prior on P(Function called)
  • We can gather information from IDE keystrokes about how often a letter typed is ultimately the first letter of that function, giving us a likelihood function P(Keystrokes|Function called)
  • Now we have enough information to compute the posterior: given a certain set of recent keystrokes, what is the probability distribution on likely functions the user wishes to call? This could give us much better IntelliSense.

And so on. The opportunities for taking advantage of Bayesian inference are enormous. We really ought to have Bayesian inference on distributions in the basic toolbox of the language, the same way we have ints, doubles, strings, nullables, functions,  tasks, sequences, and so on, in that toolbox.

That’s what I mean by “Fixing Random”. The fundamental problem is not that Random has historically had a candy-machine interface; that’s just a silly historical accident that can be fixed. Rather: we’ve decided that monads like nullable, sequence, function and task are so important that they are included in the core runtime. Why? Not because they’re cool, but because having Nullable<T>, IEnumerable<T>,  Task<T>, and so on in the core runtime makes it much easier for developers to write correct, concise code that solves their problems.

Programming is increasingly about dealing with a world of unknowns; having operators in the language for concisely describing probabilistic workflows seems very valuable to me. This series seeks to make the case for that value.

Next time on FAIC: We’ll take a closer look at the discrete probability distribution type as a monad. We might be missing a concept.

Fixing Random, part 15

[Code is here.]

Last time on FAIC we made a correct, efficient implementation of SelectMany to bind a likelihood function and projection onto a prior, and gave a simple example. I deliberately chose “weird” numbers for all the weights; let’s do that same example again but with more “nice round number” weights:

var prior = new List<Height>() { Tall, Medium, Short }
  .ToWeighted(60, 30, 10);
IDiscreteDistribution<Severity> likelihood(Height h)
    case Tall: return severity.ToWeighted(45550);
    case Medium: return severity.ToWeighted(07030);
    defaultreturn severity.ToWeighted(001);
[… projection as before…]
.SelectMany(likelihood, projection)

This produces the output:


which is correct, but you notice how multiplying the weights during the SelectMany made for some unnecessarily large weights. If we then did another SelectMany on this thing, they’d get even larger, and we’d be getting into integer overflow territory.

Integer overflow is always possible in the system I’ve developed so far in this series, and I am deliberately glossing over this serious problem. A better implementation would either use doubles for weights, which have a much larger range, or arbitrary-precision integers, or arbitrary-precision rationals. I’m using integers to keep it simple, but as with many aspects of the code in this series, that would become problematic in a realistic implementation.

One thing we can do to tame this slightly is to reduce all the weights when possible; plainly in this case we could divide each of them by 10000 and have exactly the same distribution, so let’s do that. And just to make sure, I’m going to mitigate the problem in multiple places:

  • In SelectMany we could be taking the least common multiple (LCM) instead of the full product of the weights.
  • In the WeightedInteger factory we could be dividing out all the weights by their greatest common divisor (GCD).

Long-time readers of my blog may recall that I’ve implemented Euclid’s Algorithm before, but this time I’m going to make a much simpler implementation:

public static int GCD(int a, int b) => 
  b == 0 ? a : GCD(b, a % b);

We define the GCD of two non-negative integers a and b as:

  • if both zero, then zero
  • otherwise, if exactly one is zero, then the non-zero one
  • otherwise, the largest integer that divides both.

Exercise: Prove that this recursive implementation meets the above contract.

The problem we face though is that we have many weights and we wish to find the GCD of all of them. Fortunately, we can simply do an aggregation:

public static int GCD(this IEnumerable<int> numbers) => 

Similarly we can compute the LCM if we know the GCD:

public static int LCM(int a, int b) =>
  a * b / GCD(a, b);
public static int LCM(this IEnumerable<int> numbers) =>
numbers.Aggregate(1, LCM);

And now we can modify our WeightedInteger factory:

public static IDiscreteDistribution<int> Distribution(
  IEnumerable<int> weights)
  List<int> w = weights.ToList();
  int gcd = weights.GCD();
  for (int i = 0; i < w.Count; i += 1)
    w[i] /= gcd;

And our SelectMany:

int lcm = prior.Support()
  .Select(a => likelihood(a).TotalWeight())
[… and then use the lcm in the query …]

See the code repository for all the details. If we apply all these changes then our results look much better…


… and we are at least a little less likely to get into an integer overflow situation.

Aside: Of course we can do the same thing to the Bernoulli class, and normalize its weights as well.

Next time on FAIC: We can use the gear we’ve created so far to solve problems in Bayesian inference; we’ll see how.

So long, MSDN blog

UPDATE 3: Rock stars Scott Hanselman and Dan Fernandez and their colleagues have gotten my MSDN blog back up, and will also restore the late cbrumme’s blog as well. Thank you both, and everyone else at what I can only assume is the Microsoft Content Migration Disaster Mitigation Team for your prompt attention. I very much appreciate it.

I’m still going to migrate all my content over to though. 🙂

UPDATE 2: The awesome Scott Hanselman informs me that there has been a “hiccup” during migration, and that the intention was to archive the MSDN blogs in a read-only format with the same links; they should be back soon.

UPDATE 1: I see this has been linked from HackerNews; welcome, new readers. Normally this blog is not me complaining about Microsoft corporate decision making blunders. I’m currently on part 15 of a series on basic probabilistic programming in C#, so if that interests you, stick around!

For reasons unknown to me, my MSDN blog has been deleted without warning. (Microsoft, I would have appreciated a heads-up. It’s not like you don’t know how to reach me!)

This is unfortunate, since there are literally thousands of links to it spread over the internet that are now dead. And there was a lot of good historical content there. This is very disappointing.

Fortunately I have a backup of all the text, and the graphs and images can be recreated.

I’ve started putting up the old content here, but it will take some time to get it all formatted correctly and whatnot. So far I’ve made it through September 12, 2003, so one day down, many hundreds more to go.

Apparently all the old MSDN blogs are being taken down, which is a great loss. I relied upon old blogs like the late, great cbrumme’s blog to archive the early design decisions for .NET, and there are many others that will be missed.



Fixing Random, part 14

[Code is here.]

Last time on FAIC we achieved two major results in our effort to build better probability tools. First, we demonstrated that the SelectMany implementation which applies a likelihood function to a prior probability is the bind operation of the probability monad. Second, we gave an implementation of a wrapper object that implements it. It’s action can be summed up as:

  • sample from the prior distribution
  • use the likelihood function to get the conditional distribution
  • sample from the conditional distribution
  • run the projection on the pair of samples to get the result

You probably recall though that I did not implement the Weight function. It’s a little tricky to do so, for two reasons. First, I made the (now somewhat questionable!) decision to make weights integers. If the weights are fractions between 0.0 and 1.0, you can just multiply the weight of the prior sample by the weight of the conditional sample. (And if the weights are logarithms, you can just add them.) It’s trickier with integers. And second, the projection at the end introduces once again the possibility that there will be “collisions”; the projection could pick non-unique values for unique combinations of the samples, that then have to be weighted as the sum.

That’s all a little abstract, so let’s work an example.

Suppose we have a population of people who have been diagnosed with Frob Syndrome, which seems to be linked with height. We’ll divide the population of Frob Syndrome patients into three categories:

enum Height { Tall, Medium, Short }

and let’s suppose in our study population there are five tall people, two medium-height people, and one short person in every eight:

var prior = new List<Height>() TallMediumShort }
  .ToWeighted(5, 2, 1);

(To keep the code short on the page, suppose I have using static directives for each.)

Now let’s suppose we’ve done a survey of each of the tall, medium and short people to learn the severity of their symptoms:

enum Severity { Severe, Moderate, Mild }

At this point I’m going to make the numbers a bit odd to illustrate the mathematics more clearly.  What is the likelihood of a member of each group to report symptoms? Let’s say that 10 out of every 21 tall people report severe symptoms, and the remaining 11 report moderate symptoms. For the medium-height people, 12 out of 17 report moderate symptoms and 5 report mild symptoms. And all the short people report mild symptoms:

var severity = new List<Severity> SevereModerateMild };

<Severity> likelihood(Height h)
    case Tall: return severity.ToWeighted(10, 11, 0);
    case Medium: return severity.ToWeighted(0, 12, 5);
    default: return severity.ToWeighted(0, 0, 1);

And now let’s suppose we have a recommended prescription level:

enum Prescription { DoubleDose, NormalDose, HalfDose }

Taller people or people with more severe symptoms get a higher dose; shorter people or people with mild symptoms get a smaller dose:

Prescription projection(Height h, Severity s)
  switch (h)
    case Tall: return s == Severe ? DoubleDose : NormalDose;
    case Medium return s == Mild ? HalfDose : NormalDose;
    default: return HalfDose;

The question now is: what is the probability distribution on prescriptions for this study population?  That is, if we picked a random member of this population, how likely is it that they’d have a double, normal or half dose prescription?

IDiscreteDistribution<Prescription> doses =
  prior.SelectMany(likelihood, projection);

The problem is to work out the weightings of the three possible outcomes.

As I mentioned before, it’s easiest to do this when the weights are fractions because we can then just multiply them and then add them up:

Height        Severity           Prescription
Tall   (5/8)  Severe   (10/21)   DoubleDose (25/84)
Tall   (5/8)  Moderate (11/21)   NormalDose (55/168)
Medium (2/8)  Moderate (12/17)   NormalDose  (3/17)
Medium (2/8)  Mild      (5/17)   HalfDose    (5/68)
Short  (1/8)  Mild      (1/1)    HalfDose    (1/8)

(To save space I’ve elided the zero rows.)

So the probability of a member of this population getting a double dose is 25/84, getting a normal dose is 55/168 + 3/17 = 1439/2856, and getting a half dose is 5/68 + 1/8 = 27/136. Verifying that those add up to 1.0 is left as an exercise.

But we’re going to grit our teeth here and do it all in integers! How might we do that?

Well, we know how to eliminate fractions: multiply all the weights in the first column by 8, and all the weights in the second column by 21 * 17, and none of the proportions will change:

Height      Severity         Prescription
Tall   (5)  Severe   (170)   DoubleDose (850)
Tall   (5)  Moderate (187)   NormalDose (935)
Medium (2)  Moderate (252)   NormalDose (504)
Medium (2)  Mild     (105)   HalfDose   (210)
Short  (1)  Mild     (357)   HalfDose   (357)

So the integer weights are: double dose is 850, normal dose is 935 + 504 = 1439, and half dose is 210 + 357 = 567.

Let’s implement it!

First off, oddly enough there is a Sum() extension method but no Product() extension method, so let’s implement that:

public static int Product(this IEnumerable<int> items) =>
  items.Aggregate(1, (a, b) => a * b);

And I also need to know the total weight of a distribution:

public static int TotalWeight<T>(
    this IDiscreteDistribution<T> d) =>
  d.Support().Select(t => d.Weight(t)).Sum();

And now we can implement the algorithm I just sketched out:

int product = prior.Support()
  .Select(a => likelihood(a).TotalWeight())
var w = from h in prior.Support()
        let ps = likelihood(h)
        from s in ps.Support()
        group prior.Weight(h) ps.Weight(s) *
              product / ps.TotalWeight()
        by projection(h, s);
var dict = w.ToDictionary(g => g.Key, g => g.Sum());
var doses = dict.Keys.ToList();
var weights = dict.Values.ToList();

And sure enough, if we print those last two out:

DoubleDose, NormalDose, HalfDose
850, 1439, 567

Super, we can now work out the weights in our implementation of SelectMany.

But… wait a minute. Why do we have to?

That is, why do we need a Combined wrapper class for SelectMany at all?

We just worked out the weights of every member of the support, and we did so making no assumptions whatsoever about the prior or the likelihood function. We can delete our Combined wrapper class, and replace our implementation of SelectMany with:

public static IDiscreteDistribution<C> SelectMany<A, B, C>(
  this IDiscreteDistribution<A> prior,
  Func<A, IDiscreteDistribution<B>> likelihood,
  Func<A, B, C> projection)
  int product = prior.Support()
    .Select(a => likelihood(a).TotalWeight())
  var w = from a in prior.Support()
          let pb = likelihood(a)
          from b in pb.Support()
          group prior.Weight(a) * pb.Weight(b) *
            product / pb.TotalWeight()
          by projection(a, b);
  var dict = w.ToDictionary(g => g.Key, g => g.Sum());
  return dict.Keys.ToWeighted(dict.Values);

Exercise: Do you see any potential pitfalls in this implementation of computing the new weights? Give it some thought; I’ll give the answer in the next episode.

We do a small amount of math up front, and in exchange, we have computed the exact resulting probability distribution, which we can sample from efficiently. Just as we did with Where and Select​ in previous episodes.

Aside: Once again, if you trace through all the logic I’ve written so far you will quickly see that it is hugely inefficient in terms of the amount of re-computation it does and garbage it produces. If we were writing this for production code, we’d be a lot more aggressive about finding code paths that do re-computation and eliminating them. The point of this exercise is that our code produces correct, efficient distribution objects out the other end, even if it is a bit wasteful to do so in this particular pedagogic implementation.

Think about the power of this: you can write programs that treat discrete probability distributions over arbitrary types as values, the same way you’d treat integers, strings, sequences, or whatever, as values that obey a particular set of algebraic rules. We can project, condition and combine them together with the same ease that we do today with sequences, and sample from them to boot!

The idea that we can describe a probabilistic workflow, and have as the output of that workflow a new distribution semantically equivalent to the effect of the workflow, but without any long-running sample-and-reject loops due to the conditions, is called inference by the probabilistic programming community.

We’ve seen that we can do inference on arbitrary discrete distributions provided that the supports are small and the weights are small integers; as we’ll see throughout the rest of this series, the problem gets considerably harder as we abandon some of those simplifying assumptions.

Next time on FAIC: I’m going to implement a minor optimization in our weighted integer distribution. After that, we’ll put it all together to show how what we’ve developed so far can be used for Bayesian inference.


Fixing Random, part 13

[Code is here.]

Last time on FAIC we discovered the interesting fact that conditional probabilities can be represented as likelihood functions, and that applying a conditional probability to a prior probability looks suspiciously like SelectMany, which is usually the bind operation on the sequence monad. We created a new implementation of SelectManythat creates an object which samples from the prior, calls the likelihood, and then samples from the resulting distribution. Is that the bind operation on the probability distribution monad?

Aside: If you’re completely confused by the preceding paragraph, you might want to read my gentle introduction to monads for OO programmers. Go ahead and read that over if it is not fresh in your mind.

We need the following things to have a monad in C#:

  • We need an “embarrassingly generic” type: some Foo<T> where it can sensibly take on any T whatsoever. IDiscreteDistribution<T> meets that condition.
  • The type represents an “amplification of power” of the underlying type. Indeed it does; it allows us to represent a probability distribution of particular values of that type, which is certainly a new power that we did not have before.
  • We need a way of taking any specific value of any T, and creating an instance of the monadic type that represents that specific value. Singleton.Distribution(t) meets that condition.
  • There is frequently(but not necessarily) an operation that extracts a value of the underlying type from an instance of the monad. Sample() is that operation. Note that sampling a singleton always gives you back the original value.
  • There is a way to “bind” a new function onto an existing instance of the monad. That operation has the signature M<R> SelectMany<A, R>(M<A> m, Func<A, M<R>> f).  We traditionally call it SelectMany in C# because that’s the bind operation on IEnumerable<T>, and it produces a projection on all the elements from a sequence of sequences. As we saw last time, we have this function for probability distributions.
  • Binding the “create a new instance” function to an existing monad must produce an equivalent monad. I think it is pretty clear that if we have an IDiscreteDistribution in hand, call it d, that SelectMany(d, t => Singleton.Distribution(t)) produces an object that has the same distribution that d does. If that’s not clear, play around with the code until it becomes clear to you.
  • Going “the other direction” must also work. That is, if we have a Func<A, IDiscreteDistribution<B>> called f, and a value of A, then SelectMany(Singleton<A>.Distribution(a), f) and f(a) must produce logically the same IDiscreteDistribution<B>. Again, if that’s not clearly true in your mind, step through the code mentally or write some sample code and convince yourself that it is true.
  • Two bind operations “on top of each other” must produce the same logical result as a single bind that is the composition of the two bound functions. That’s maybe a little vague; see Part 7 of my series on monads for details. Suffice to say, we meet this condition as well.

All our conditions are met; IDiscreteDistribution<T> is a monad. So we should be able to use it in a query comprehension, right?

from c in cold
from s in SneezedGivenCold(c)
select s

Unfortunately this gives an error saying that SelectMany cannot be found; what’s up with that?

The query comprehension syntax actually requires a slight variation on the traditional “bind” operation; it requires that we also allow a projection on the end, and that moreover, the projection take both the original value and the transformed value. That is, C# requires us to implement it like this:

public sealed class Combined<A, B, C> :
  private readonly List<C> support;
  private readonly IDiscreteDistribution<A> prior;
  private readonly Func<A, IDiscreteDistribution<B>> likelihood;
  private readonly Func<A, B, C> projection;
  public static IDiscreteDistribution<C> Distribution(
      IDiscreteDistribution<A> prior, 
      Func<A, IDiscreteDistribution<B>> likelihood, 
      Func<A, B, C> projection) =>
    new Combined<A, B, C>(prior, likelihood, projection);
  private Combined(
    IDiscreteDistribution<A> prior, 
    Func<A, IDiscreteDistribution<B>> likelihood, 
    Func<A, B, C> projection)
    this.prior = prior;
    this.likelihood = likelihood;
    this.projection = projection;
    var s = from a in prior.Support()
            from b in this.likelihood(a).Support()
            select projection(a, b);
   = s.Distinct().ToList();

  public IEnumerable<C> Support() => => x);
  public int Weight(C c) => NOT YET!
  public C Sample()
    A a = this.prior.Sample();
    B b = this.likelihood(a).Sample();
    return this.projection(a, b);

And now we can implement SelectMany as

public static IDiscreteDistribution<C> SelectMany<A, B, C>(
    this IDiscreteDistribution<A> prior,
    Func<A, IDiscreteDistribution<B>> likelihood,
    Func<A, B, C> projection) =>
  Combined<A, B, C>.Distribution(prior, likelihood, projection);

and of course if we want a SelectMany with the traditional monad bind signature, that’s just

public static IDiscreteDistribution<B> SelectMany<A, B>(
    this IDiscreteDistribution<A> prior,
    Func<A, IDiscreteDistribution<B>> likelihood) =>
  SelectMany(prior, likelihood, (a, b) => b);

Now that we have aSelectMany, we can write conditional probabilities in comprehension syntax, as before:

var sneezed = from c in cold
              from s in SneezedGivenCold(c)
              select s;

or, if we like, we can extract a tuple giving us both values:

public static IDiscreteDistribution<(A, B)> Joint<A, B>(
    this IDiscreteDistribution<A> prior,
    Func<A, IDiscreteDistribution<B>> likelihood) =>
  SelectMany(prior, likelihood, (a, b) => (a, b));

var joint = cold.Joint(SneezedGivenCold);

and if we graph that, we see that we get the distribution we worked out by hand from last episode:



  (No, No)|****************************************
 (No, Yes)|*
 (Yes, No)|
(Yes, Yes)|***
  (No, No):873
 (No, Yes):27
 (Yes, No):15
(Yes, Yes):85

Aside: Of course I am cheating slightly here because I have not yet implemented the weight function on the combined distribution; we’ll get to that next time!

It might seem slightly vexing that C# requires us to implement a variation on the standard bind operation, but in this case it is actually exactly what we want. Why’s that?

Let’s remind ourselves of how we are notating probability distributions. If we have a collection of possible outcomes of type Cold, we notate that distribution as P(Cold); since Cold has two possibilities, this distribution is made up from two probabilities, P(Cold.Yes) and P(Cold.No) which add up to 100%. We represent this in our type system as IDiscreteDistribution<Cold>

A conditional probability distribution P(Sneezed|Cold) is “given a value from type Cold, what is the associated distribution P(Sneezed)“?  In other words, it is Func<Cold, IDiscreteDistribution<Sneezed>>.

What then is P(Cold&Sneezed)?  That is our notation for the joint distribution over all possible pairs. This is made up of four possibilities: P(Cold.No & Sneezed.No), P(Cold.No&Sneezed.Yes), P(Cold.Yes&Sneezed.No), and P(Cold.Yes&Sneezed.Yes), which also add up to 100%.

In our type system, this is IDiscreteDistribution<(Cold, Sneezed)>

Now, remember the fundamental law of conditional probability is:

P(A) P(B|A) = P(A&B)

That is, the probability of A and B both occurring is the probability of A occurring, multiplied by the probability of B occurring given that A has.

That is, we can pick any values from those types, say:

P(Cold.Yes) P(Sneezed.Yes|Cold.Yes) = P(Cold.Yes&Sneezed.Yes)

That is, the probability of some value of A and some value of B both occurring is the probability of the value of A occurring multiplied by the probability of the value of B given that the value of A has occurred.

Aside: “multiplication” here is assuming that the probabilities are between 0.0 and 1.0, but again, squint a little and you’ll see that it’s all just weights. In the next episode we’ll see how to compute the weights as integers by thinking about how to do the multiplication in fractions.

We’ve implemented P(A) as IDiscreteDistribution<A>, we’ve implemented P(B|A) as Func<A, IDiscreteDistribution<B>>, and P(A&B) as IDiscreteDistribution<(A, B)>.

We have a function Joint<A, B>​ that takes the first two and gives you the third, and if you work out the math, you’ll see that the probabilities of each member of the joint distribution that results are the products of the probabilities given from the prior and the likelihood. Multiplication of a prior probability by a likelihood across all members of a type is implemented by SelectMany. 

Coming up on FAIC: We’ll work out the weights correctly, and that will enable us to build an optimized  SelectMany implementation.

Fixing Random, part 12

[Code is here.]

Last time on FAIC we implemented an efficient “conditioned” probability using the Where operator on distributions; that is, we have some “underlying” distribution, and we ask the question “if a particular condition has to be met, what is the derived distribution that meets that condition?” For discrete distributions we can compute that distribution directly and just return it.

There is another kind of conditional probability though, which is much more rich, complex and counter-intuitive, and that is exploring the relationship between “what is the probability of X?” and “what is the probability of Y given that we know X?

For example: pick a random person in the world who has a cold. What is the probability that they sneezed in the last 24 hours? Probably something like 85%.

Now pick a random person who does not have a cold. For them, the probability is maybe more like 3%. In months when I do not have a cold, I sneeze maybe one or two days.

So what we’ve got here is a rather more complex probability distribution; in fact we have two entirely different distributions, and which one we use depends on a condition.

Notice how this is clearly related to our recent discussion of conditioned probabilities, but different. With a “Where” clause we are saying make the support of this distribution smaller because some outcomes are impossible based on a condition. What we’re talking about here is choosing between two (or more) distributions depending on a condition.

The standard notation for this kind of probability in mathematics is a bit unfortunate. We would say something like P(sneezed|no cold ) = 0.03 to represent “3% chance that I sneezed if I didn’t have a cold” and P(sneezed|cold) = 0.85 to represent “85% chance that I sneezed if I had a cold”. That is, the syntax is P(A|B) means “what is the probability of A given that B happened?”

How might we represent this in our system? It seems like IDiscreteDistribution<T> is not rich enough. Let’s just start making some types and see what we can come up with.

“Has sneezed recently” and “has a cold” are Booleans, but I want the types of everything to be very clear in the analysis which follows, so I’m going to make my own custom types:

enum Cold { No, Yes }
enum Sneezed { No, Yes }

I want to be slightly abusive of notation here and say that P(Cold.Yes) and P(Cold.No) are the weights of a probability distribution that I’m going to call by the shorthand P(Cold). Similarly for P(Sneezed); that’s the probability distribution that gives weights to P(Sneezed.Yes) and P(Sneezed.No). Normally we think of P(something) as being a value between 0.0 and 1.0, but if you squint at it, really those values are just weights. It doesn’t matter what convention we use for weights; a bunch of integers that give ratios of probabilities and a bunch of doubles that give fractions have pretty much the same information content.

Plainly what I would very much like is to have IDiscreteDistribution<Cold> be the C# type that represents P(Cold).

But how can we represent our concept of “There’s a 3% chance I sneezed if I do not have a cold, but an 85% chance if I do have a cold?”

That sure sounds like precisely this:

IDiscreteDistribution<Sneezed> SneezedGivenCold(Cold c)
  var list = new List<Sneezed>() { Sneezed.No, Sneezed.Yes };
  return c == Cold.No ?
    list.ToWeighted(97, 3) :
    list.ToWeighted(15, 85);

That is, if we do not have a cold then the odds are 97 to 3 that we did not sneeze, and if we do have a cold, then the odds is 15 to 85 that we did not sneeze.

I’ve said that I want to represent P(Cold.Yes) and P(Cold.No) by the shorthand P(Cold), and that plainly this in our type system is IDiscreteDistribution<Cold>. Now I want to represent the notion of P(Sneezed) given a value of Cold as P(Sneezed|Cold), which is implemented by the function above. So, what type in our type system is that? Well, suppose we wanted to assign SneezedGivenCold to a variable; what would its type be? Clearly the type of P(Sneezed|Cold) is Func<Cold, IDiscreteDistribution<Sneezed>>!

How interesting! Conditional probabilities are actually functions.

This sort of function has a name; it is called a likelihood function. That is, given some condition, how likely is some outcome?

So that’s interesting, but how is this useful?

Let’s randomly choose a person in the world, where we do not know whether they have a cold or not. What is the probability that they sneezed recently? It depends entirely on the prevalence of colds! If 100% of the world has a cold, then there’s an 85% chance that a randomly chosen person sneezed recently, but if 0% of the world has a cold, then there’s only a 3% chance. And if it is somewhere in between, the probability will be different from either 85% or 3%.

To solve this problem we need to know the probability that the person we’ve chosen has a cold. The probability that some randomly chosen person has a cold is called the prior probability.

What if 10% of the world has a cold? Let’s work it out by multiplying the probabilities:

 Cold      Sneezed       Result
 (prior)   (likelihood)  (conditional)
 10% Yes   85% Yes       8.5% have a cold, and sneezed
           15% No        1.5% have a cold, did not sneeze
 90% No     3% Yes       2.7% do not have a cold, and sneezed
           97% No       87.3% do not have a cold, did not sneeze

Sure enough those probabilities in the right column add up to 100%. The probability that a randomly chosen person in the world sneezed recently (given that these numbers that I made up are accurate) is 8.5% + 2.7% = 11.2%.

The rightmost column of the table that I’ve sketched out here is called the joint probability, which we’ll notate as P(Cold&Sneezed).

We can write this table more compactly like this:

             Cold Yes    Cold No   Total
Sneezed Yes     8.5%        2.7%   11.2%
Sneezed No      1.5%       87.3%   88.8%
Total            10%         90%    100%

The rightmost column of this table is called the marginal probability, so-called because of the way the sums end up at the margins of the table.

What if we expressed the marginal probability as integers? The odds that a random person has sneezed is 11.2% to 88.8%, which if you work out the math, is exactly odds of 14 to 111.

Aside: when I was debugging the code to compute the weights that we will see in a future episode, I got “111” printed out when I was primed to see “112”, having just computed “11.2%” by hand. I almost went on a lengthy bug hunt looking for the non-existing off-by-one error. Fortunately I stopped and double-checked my work, and realized that the 111 represents the 88.8%, not the 11.2%.

How can we do this math given the set of types we’ve created so far? Let’s start with the prior:

var colds = new List<Cold>() { Cold.No, Cold.Yes };
IDiscreteDistribution<Cold> cold = colds.ToWeighted(90, 10);

We’ve got the prior, and we’ve got the likelihood function SneezedGivenCold. We would like to get the marginal probability IDiscreteDistribution<Sneezed>​.

We could implement such a distribution by first sampling from the prior, then calling SneezedFromCold, and then sampling from the returned distribution. Let’s implement it.

Aside: We are of course assuming that the likelihood function is pure.

public sealed class Combined<A, R> : IDiscreteDistribution<R>
  private readonly List<R> support;
  private readonly IDiscreteDistribution<A> prior;
  private readonly Func<A, IDiscreteDistribution<R>> likelihood;
  public static IDiscreteDistribution<R> Distribution(
      IDiscreteDistribution<A> prior,
      Func<A, IDiscreteDistribution<R>> likelihood) =>
    new Combined<A, R>(prior, likelihood);
  private Combined(
    IDiscreteDistribution<A> prior,
    Func<A, IDiscreteDistribution<R>> likelihood)
    this.prior = prior;
    this.likelihood = likelihood;
    var q = from a in prior.Support()
            from b in this.likelihood(a).Support()
            select b; = q.Distinct().ToList();
  public IEnumerable<R> Support() => => x);
  public R Sample() =>
  public int Weight(R r) => WE’LL COME BACK TO THIS ONE

We haven’t implemented Weight, but we don’t need it to run a histogram. Let’s try it out:

Combined<Cold, Sneezed>.Distribution(cold, SneezedGivenCold)


Sure enough, it looks like there is about an 11% chance that a randomly chosen person sneezed, given these distributions.

Now, of course as I have done throughout this series, let’s make a little helper function to make the call sites look a little nicer:

public static IDiscreteDistribution<R> MakeCombined<A, R>(
    this IDiscreteDistribution<A> prior,
    Func<A, IDiscreteDistribution<R>> likelihood) => 
  Combined<A, R>.Distribution(prior, likelihood);

Once again, that should look very familiar! I should change the name of this helper.

If you are still surprised at this point, you have not been paying attention. I’ve already made Select and Where, so the obvious next step is…

public static IDiscreteDistribution<R> SelectMany<A, R>(
    this IDiscreteDistribution<A> prior,
    Func<A, IDiscreteDistribution<R>> likelihood) => 
  Combined<A, R>.Distribution(prior, likelihood);

the bind operation on the probability monad.

And the inelegant call site above is now the much more clear:


Coming up on FAIC: We’ll verify that the distribution type really is a monad, and make a few tweaks to get it working with query comprehension syntax. Then we’ll figure out how to implement the Weight​ function.