Fixing Random, part 19

I’ve got no code for you this time; instead here are some musings about language design informed by our discussion so far.

One of the most important questions to ask when designing a language feature is: what should the balance be between generality and specificity of a feature?

My go-to example is LINQ, and what we’ve been talking about for the last dozen episodes illustrates the problem nicely.

As we’ve seen many times in this blog, and particularly in this series, query comprehensions allow us to concisely express projection (Select), binding (SelectMany) and use of the zero of an additive monad (Where) to express a condition. Because C# uses a syntactic transformation to implement LINQ, we can express these concepts on an instance of any monad that has the right methods. The compiler does not know or care whether the thing being selected from is a sequence, a probability distribution, whatever, doesn’t matter. All that matters is that a method with the right signature is available as an instance or extension method.

The language designers for C# 3.0 were therefore faced with a choice: how general should the LINQ feature be? There’s a spectrum of possibilities; a few points on that spectrum are:

  • We all know that a monad is just a monoid in the category of endofunctors, which implies that there are other patterns like “monoid” and “category” and “functor” that we could be describing in the type system; what’s so special about monads? We could create a whole “higher-kinded” type system to describe generic type patterns; “monad” is just one possible pattern.
  • We could embed monads as a special kind of thing right into the language, using the vocabulary of monads: bind, unit and so on. In this language, sequences are just a special case of the more general monad feature.
  • The feature could be made specifically about operations on sequences: not just Select, SelectMany and Where which make sense on multiple monads, but also operations that do not apply across monads, like GroupBy, Join and OrderBy. The operations could be applicable to any data source that supports those operations, whether XML documents or database tables or plain old lists, but the concepts in the API are tailored to the domain of data in the form of a sequence of things, but it has to be a sequence.
  • The feature could be narrowly tailored to the needs of a particular data technology; LINQ could have been simply allowing SQL Server query syntax to be embedded directly in the language, and it could have been made to work only with SQL Server and no other dialects of SQL or other database back ends.

The designers of C# chose a point on the spectrum slightly more general than the third point: the feature is not written in the jargon of monads, but it is more general than simple operations on sequences. The designers of Haskell chose maximum generality, and so on.

These are of course not the only options; there were any number of points on that spectrum where the language design could have fallen, and different language designers have chosen different points. Think about list comprehensions in Python, for instance; they are obviously similar to LINQ in many respects, but you’re not going to be marshaling a list comprehension to a database or using it as a generalized monadic bind operation any time soon. There’s nothing wrong with that; it was a reasonable choice for Python.

Our exploration of representing distributions as a monad and combining them using LINQ operators and comprehensions illustrates the pros and cons of C#’s choice. Though I think it is delightful that we can use Select, SelectMany and Where on probability distributions as though they were infinite sequences, and pleased that we can optimize the results using the laws of probability for discrete distributions, the language designer part of me feels like this is somehow wrong. We’re using the vocabulary of sequences, not the vocabulary of probability, and it feels out of place.

This in particular was brought into focus for me in my April 1st bonus episode. Though it was in parts deliberately obtuse and misleading, there was no real joke there; rather, I wished to illustrate a serious point with a silly example. Consider these two queries:

from car in doors
from you in doors
from monty in (
    from m in doors 
    where m != car 
    where m != you 
    select m)
select (car, you, monty)


from car in doors
from you in doors
from monty in doors 
where monty != car 
where monty != you 
select (car, you, monty)

Plainly we can refactor the first into the second if doors is a sequence; this program transformation preserves semantics for sequences. But if doors is a distribution then the refactoring preserves the support but changes the weights.

This is some evidence that query comprehensions are perhaps not the best way to express operations on distributions: our intuitions about what refactorings are correct is heavily influenced by our understanding of sequence semantics, and this could lead to bugs.

Moreover, think about how the design of C# has evolved with respect to monadic types:

  • C# 2 added nullable types and embedded nullable arithmetic and conversions in the language, but did not provide a general mechanism for lifting function calls and other operations to nullables until C# 6 added ?.
  • C# 2 added statement-based sequence workflows in the form of yield return.
  • C# 3 added query comprehensions for composing operations on sequences in an expression context
  • Reactive extensions (Rx) leveraged LINQ to support observables — push-based sequences — but did not add any new syntax; it could do this because the duality between pull- and push-based sequences is strong and our intuitions largely still give correct outcomes.
  • C# 5 added statement-based asynchronous workflows; though you can make async lambdas and use the await operator in the middle of an expression, asynchronous workflows are fundamentally collections of statements, not fluent compositions of expressions like query comprehensions.

You’ll notice that there is no overarching philosophy here; rather, as the needs of real-world developers evolve, the language evolves to represent different concepts at different levels of generality.

Async workflows could have been made to look just like Rx query expressions; after all, a task is logically an observable that pushes a single value instead of multiple values. But the designers of C# figured that it would be more natural for developers to mix asynchrony into their statement-based workflows. Similarly, nullables could be treated as sequences that contain zero or one value, but they’re not.

The lesson of C# here is: when you’re trying to represent a new concept, try creating a domain-specific subset of the language that solves the problem using the vocabulary of the problem, but with enough generality to be extensible in the future.

Could we do the same for stochastic workflows? Using LINQ as our representation has been cute, fun and educational, but as language designers we can recognize that real-world users would likely find this choice confusing and weird; distributions are like sequences in many ways, but they are also dissimilar in a lot of ways.

Can we come up with a domain-specific language subset that better matches the domain of probabilistic programming, but preserves (or improves upon) the nice properties that we’ve seen so far? We’ve seen that we can express a discrete-distribution workflow as a comprehension and automatically get an inferred distribution out the other end; could we do the same thing in a statement-based workflow language, akin to async/await?

Next time on FAIC: I’m going to sketch out a super-stripped down version of C# and add two new statements for probabilistic workflows. We’ll then show how that stripped-down version could be lowered to ordinary C# 7.


Fixing Random, part 18

Before that silly diversion I mentioned that we will be needing the empty distribution; today, we’ll implement it. It’s quite straightforward, as you’d expect. [Code for this episode is here.]

public sealed class Empty<T> : IDiscreteDistribution<T>
  public static readonly Empty<T> Distribution = new Empty<T>();
  private Empty() { }
  public T Sample() =>
    throw new Exception(“Cannot sample from empty distribution”);
  public IEnumerable<T> Support() =>
  public int Weight(T t) => 0;

Easy peasy. Now that we have this, we can fix up our other distributions to use it. The WeightedInteger factory becomes:

public static IDiscreteDistribution<int> Distribution(
  IEnumerable<int> weights)
  List<int> w = weights.ToList();
  if (w.Any(x => x < 0))
    throw new ArgumentException();
  if (!w.Any(x => x > 0))
    return Empty<int>.Distribution;

And the Bernoulli factory becomes:

public static IDiscreteDistribution<int> Distribution(
  int zero, int one)
  if (zero < 0 || one < 0)
    throw new ArgumentException();
  if (zero == 0 && one == 0)
    return Empty<int>.Distribution;

And the StandardDiscreteUniform factory becomes:

public static IDiscreteDistribution<int> Distribution(
  int min, int max)
  if (min > max)
    return Empty<int>.Distribution;

And the Projected factory becomes:

public static IDiscreteDistribution<R> Distribution(
  IDiscreteDistribution<A> underlying, Func<A, R> projection)
  var result = new Projected<A, R>(underlying, projection);
  if (result.weights.Count == 0)
    return Empty<R>.Distribution;

And one more thing needs to change. Our computation in SelectMany assumed that none of the total weights are zero. Easily fixed:

int lcm = prior.Support()
  .Select(a => likelihood(a).TotalWeight())
  .Where(x => x != 0)

We also have a division by total weight; don’t we have to worry about dividing by zero? Nope. Remember, the empty distribution’s support is the empty sequence, so when we then say:

var w = from a in prior.Support()
        let pb = likelihood(a)
        from b in pb.Support()
        group prior.Weight(a) * pb.Weight(b) *
          lcm / pb.TotalWeight()
        by projection(a, b);

If prior.Support() is empty then the whole query is empty and so the division is never executed. If prior.Support()is not empty but one of the pb.Support() is empty then there is nob from which to compute a group key. We never actually divide by total weight, and so there is no division by zero error to avoid.

That was relatively painless, but it is probably still very unclear why we’d ever need an empty distribution. It seems to be like a little bomb hiding in the program, waiting for someone to sample it. Have we just committed another “null reference” design fault? In a few episodes we’ll see what benefits justify the costs.

Next time on FAIC: We’ve been having a lot of fun treating distributions as monads that we can use query comprehensions on, but is that really the best possible syntax?

Fixing Random, bonus episode 1

I just thought of a really cute application of the stochastic workflow technology we’ve been working on; most of the series has already been written but it fits in here, so I’m going to insert this extra bonus episode. We’ll implement the zero value next time.

Code for this bonus episode is here.

You are probably familiar with the famous “Monty Hall” problem, but if not, here it is:


  • You’re on a game show hosted by Monty Hall, the handsome Canadian fellow pictured above.
  • Before you there are three closed doors; behind a uniformly randomly selected door there is a new car; behind the other two there is nothing.
  • You get to choose a door, and you get what is behind it as a prize: either a car, or nothing.
  • You randomly choose a door, again by some uniform process.
  • Monty — who knows where the car is — now always opens a door that meets two criteria: it does not have the car behind it, and it is not the door you chose.
  • To clarify: if you chose the door with the car, Monty chooses one of the remaining two doors by a uniform random choice. If you chose a door without the car, Monty only has one door he can open, and he opens that one.
  • Monty gives you the opportunity to switch your choice to the other still-closed door.
  • Assuming you wish to maximize your probability of winning the car, should you switch doors or stay with your original choice?

Aside: I’ve tried to be very precise in my description of the game for the purposes of our analysis. In the real game as played on television there were irrelevant details such as: the “prizes” behind the other two doors were goats or other bizarre, undesirable items, and so on. But there were also germane differences between the real game and our model above; for example, in the real game Monty would sometimes offer choices like “do you want to switch your door, or forget about the doors entirely and take the prize that is in this box?” and it is not clear by what process Monty decided to offer those choices. In this simplified version of the game I’ve removed all human agency from Monty; for our purposes, Monty is just a machine that is following an algorithm that involves generating random outcomes.

Exercise 1: If you don’t already know the solution, work it out. The answer is below.







You are two-to-one more likely to win the car if you switch than if you stay. But don’t take my word for it. Let’s solve the problem with computers, not by thinking!

Plainly the key to the problem is what is the distribution of Monty’s choice? Monty chooses a random door, but is observed to not pick a door with a car or the door which you picked. We can represent that as a two-parameter likelihood function:

IDiscreteDistribution<int> doors = SDU.Distribution(1, 3);
IDiscreteDistribution<int> Monty(int car, int you) =>
    from m in doors 
    where m != car 
    where m != you 
    select m;

There’s no logical difficulty in adding more parameters to a likelihood function; think of the parameters as a tuple if that makes you feel better.

Now we can answer the question. Here’s the probability distribution of winning if you do not switch:

var noSwitch1 = 
  from car in doors
  from you in doors
  from monty in Monty(car, you)
  select car == you ? “Win” : “Lose”;

And the output is:


As predicted by thinking, you are twice as likely to lose if you do not switch. Computers for the win!

Exercise 2: Wait a minute, we never even used the value of range variable monty in the query. How is it possible that adding a from clause to the query changes its outcome when the sampled value is not even used?!?

Give this some thought and answer in the comments.

Exercise 3: OK smart person, if you thought that one was easy, take a look at this one.

We have our likelihood function Monty() which is just a query comprehension, and our noSwitch1 which is also just a query comprehension. We can make the program a little bit shorter by combining them together in the obvious way:

var noSwitch2 = 
  from car in doors
  from you in doors
  from monty in doors
  where monty != car 
  where monty != you 
  select car == you ? “Win” : “Lose”;

And if we print out the weights of that one… uh oh.


I would have thought this program fragment to be logically the same as before, but this gives weights of 1:1 when we know the correct answer is 1:2.

Where did I go wrong?

Again, answer in the comments.

Next time on FAIC: Let’s get back on track from this silly diversion! We were talking about the zero value, so let’s implement it.

Fixing Random, part 17

Before we get going on today’s episode of FAIC, you might want to refresh your memory of what an additive monad is; I wrote an episode of my monad series on this subject. Briefly, an additive monad is a monad where there is a “zero value”; like the number zero, “multiplying” by zero produces a zero, and “adding” a zero is an identity.

For example, the sequence monad, IEnumerable<T>, has a zero: the empty sequence. If we Select or SelectMany from the empty sequence — our analog of “multiplying” — we get an empty sequence.  If we concatenate an empty sequence onto another sequence — the sequence analog of “adding” — we get the original sequence.

All additive monads can have a Where function defined on them; if we wanted to implement Where for sequences and didn’t care about performance, we could implement it like this:

public static IEnumerable<T> Single<T>(T t)
  yield return t;
public static IEnumerable<T> Zero<T>()
  yield break;
// Non-standard Where:
public static IEnumerable<T> Where<T>(
    this IEnumerable<T> items,
    Func<T, bool> predicate) =>
  from a in items
  from b in predicate(a) ? Single(a) : Zero<T>()
  select b;

That’s slow and produces hideous collection pressure, but it works; our actual implementation of Where is just an optimization.

What about the converse? Our probability monad IDiscreteDistribution<T> has a Where function defined. We definitely have a Singleton<T> type. But our implementation of the distribution monad does not appear to have a zero value. It seems plausible that there should be a way to express Where on distributions as we did with the sequence monad: as a SelectMany that produces either the single or zero distributions based on the predicate.

Give that some thought, and then scroll down when you think you have it worked out.







Just as the zero of the sequence monad is the empty sequence, the zero of the distribution monad is the empty distribution. That is, the distribution with an empty support that throws every time it is sampled.

We never implemented this value because every distribution class we’ve created already throws when you try to create an empty distribution:

  • StandardDiscreteInteger throws if the range is empty.
  • Bernoulli and WeightedInteger both throw if you give them all zero weights.
  • In our current implementation a Where clause where the predicate is false for everything in the support of the underlying collection will eventually throw.
  • In our original implementation, a Where clause where the predicate is always false hangs when sampled, but does not throw.
  • Our implementation of Select throws if the support is empty.
  • And so on.

Exercise: We have learned the following facts:

  • The zero value of the discrete distribution monad is the empty distribution.
  • The joint distribution produced by SelectMany is the analog of multiplication of two distributions.
  • Concatenation is the “addition” of the sequence monad. (The two sequences have to be of the same element type.)

I think it is pretty clear that doing a SelectMany on an empty distribution has to produce an empty distribution. But we still have a mystery to solve: what is the addition operator on two discrete distributions? They have to be of the same element type. The addition operator has to have the property that adding zero to any distribution is an identity, but what does it mean to add together two non-zero distributions?

Answer in the comments with your thoughts.

It turns out that there are some uses for an explicit empty distribution; we’ll discover what the specific benefits of it are in a later episode.

What are the costs? I don’t mean implementation costs, but rather, what are the down sides to developers of having this feature? In short: if we go down this road, what new opportunities for bugs are we producing?

One interesting cost is that we will defer an operation that can throw; this can be very confusing! A classic source of StackOverflow questions is when someone writes an enumerator block:

static IEnumerable<int> Foo(string bar)
  if (bar == null)
    throw new ArgumentNullException();
  yield return bar.Length;

and then calls it:

var foo = Foo(accidentallyNullThing); // no throw
foreach (int x in foo) // throw!

The source of the problem is that the throw is delayed. If you look at the proper, industrial-strength implementations of Where, Select and so on, you’ll notice that each one is written in a style where it validates its arguments first, and then returns a call to a helper method that actually does the iteration. That way the exception is thrown close to the point of the mistake.

However, that doesn’t fix other common variations on the problem. For example, you might have some buggy code that produces an empty sequence sometimes, and then a thousand lines later you call First on the sequence and it blows up, but the bug is where the sequence is produced.

And of course this is really no different than nullable types that blow up when we forget that they can be null; a nullable T is logically a sequence of T where the sequence length is either zero or one, and if we forget that it can be “zero length”, we get into trouble.

The empty distribution will have the same property: it will be easy to create it by accident in a buggy program and it will not blow up until it is sampled, just as nullable reference types do not blow up until they are dereferenced.

That said, we’re going to do it because the benefits are actually pretty compelling, oddly enough.

Next time on FAIC: In the next regularly-scheduled episode we will implement the empty distribution; it’ll be quite straightforward, but it will necessitate fixing up some of our existing code. However, before then I’m going to interrupt this series with a very special episode that addresses a long-standing puzzler in probability theory which I just realized we now have all the gear we need to answer. Stay tuned!


Fixing Random, part 16

[Code is here.]

This series is getting quite long and we’re not done yet! This would be a good time to quickly review where we’re at:

  • We’re representing a particular discrete probability distribution P(A) over a small number of members of a particular type A by IDiscreteDistribution<A>.
  • We can condition a distribution — by discarding certain possibilities from it — with Where.
  • We can project a distribution from one type to another with Select.
  • A conditional probability P(B|A) — the probability of B given that some A is true — is represented as likelihood function of type Func<A, IDiscreteDistribution<B>>.
  • We can “bind” a likelihood function onto a prior distribution with SelectManyto produce a joint distribution.

These are all good results, and I hope you agree that we have already produced a much richer and more powerful abstraction over randomness than System.Random provides. But in today’s episode everything is really going to come together to reveal that we can use these tools to solve interesting problems in probabilistic inference.

To show how, we’ll need to start by reviewing Bayes’ Theorem.

If we have a prior P(A), and a likelihood P(B|A), we know that we can “bind” them together to form the joint distribution. That is, the probability of A and B both happening is the probability of A multiplied by the probability that B happens given that A has happened:

P(A&B) = P(A) P(B|A)

Obviously that goes the other way. If we have P(B) as our prior, and P(A|B) as our likelihood, then:

P(B&A) = P(B) P(A|B)

But A&B is the same as B&A, and things equal to the same are equal to each other. Therefore:

P(A) P(B|A) = P(B) P(A|B)

Let’s suppose that P(A) is our prior and P(B|A) is our likelihood. In the equation above the term P(A|B) is called the posterior and can be computed like this:

P(A|B) = P(A) P(B|A) / P(B)

I hope that is clear, but let’s move away from the abstract mathematics and illustrate an example by using the code we’ve written so far.

We can step back a few episodes and re-examine our prior and likelihood example for Frob Syndrome. Recall that this was a made-up study of a made-up condition which we believe may be linked to height. We’ll use the weights from the original episode.

That is to say: we have P(Height), we have likelihood function P(Severity|Height), and we wish to first compute the joint probability distribution P(Height&Severity):

var heights = new List<Height() { Tall, Medium, Short }
var prior = heights.ToWeighted(5, 2, 1);
IDiscreteDistribution<Severity> likelihood(Height h)
    case Tall: return severity.ToWeighted(10110);
    case Medium: return severity.ToWeighted(0125);
    defaultreturn severity.ToWeighted(001);
var joint = prior.Joint(likelihood);      Console.WriteLine(joint.ShowWeights());

This produces:

(Tall, Severe):850
(Tall, Moderate):935
(Medium, Moderate):504
(Medium, Mild):210
(Short, Mild):357

Now the question is: what is the posterior, P(Height|Severity)? Remember what this is:  it is a function that takes a severity, and returns a distribution of heights.

We can compute the marginal probabilities “by hand” by looking at the weights above:

  • If symptoms are severe, there is a 100% chance that the person is tall.
  • If symptoms are moderate, 935 study members are tall for every 504 medium-height members.
  • If symptoms are mild, then that’s 210 medium people for every 357 short.

We could implement that easily enough; it’s just another function like we’ve seen many times before in this series:

IDiscreteDistribution<Height> posterior(Severity s)
  switch(s) … blah blah blah …

But I don’t want to have a human analyze the data and write the code. We have enough information in the IDiscreteDistribution<(Height, Severity)> to generate a Func<Severity<IDiscreteDistribution>.

In fact, we can simply add another clause to our query:

IDiscreteDistribution<Height> posterior(Severity s) => 
  from pair in joint
  where pair.s == s
  select pair.h;

We can compute the posterior with a Where clause!

Recall that what we are computing here is logically P(A&B)/P(B); just as SelectMany can be thought of as a sort of multiplication, apparently Where is logically a sort of division.

But let’s not stop here; we can make a general rule in the form of an extension method, and I’m going to slap a projection onto the back side of it just for added generality because why not:

public static Func<B, IDiscreteDistribution<C>> Posterior<A, B, C>(
    this IDiscreteDistribution<A> prior,
    Func<A, IDiscreteDistribution<B>> likelihood,
    Func<A, B, C> projection) =>
  b => from a in prior
       from bb in likelihood(a)
       where object.Equals(b, bb)
       select projection(a, b);
public static Func<BIDiscreteDistribution<A>> Posterior<AB>(
    this IDiscreteDistribution<A> prior,
    Func<AIDiscreteDistribution<B>> likelihood) =>
Posterior(prior, likelihood, (a, b) => a);

Let’s take it for a spin.

Question: Given the prior distribution and the likelihood function, what is the posterior distribution of height amongst the study members with moderate symptoms?

var posterior = prior.Posterior(likelihood);

And sure enough, we get a probability distribution out that matches what we could have computed by hand:


OK, that’s pretty neat, but why is this relevant?

Because Bayesian inference is incredibly important, and incredibly easy to get wrong! Anything we can do to improve developers’ ability to use Bayesian analysis correctly is a win.

Let’s look at another example. Almost a decade ago I did a blog post where I discussed how Bayesian inference is counterintuitive. Let’s run the numbers from that blog post through our system and see what we get.

We have a disease

enum TappetsDisease { Sick, Healthy }

and our prior is that 99% of the population is healthy:

var prior = new List<TappetsDisease> { Sick, Healthy }
  .ToWeighted(1, 99);

We also have a test:

enum JethroTest { Positive, Negative }

And the test is 99% accurate. That is, if you are sick, it has a 99% chance of “positive”, and if you are healthy, it has a 99% chance of “negative”:

var tests = new List<JethroTest> { Positive, Negative };
<JethroTest> likelihood(TappetsDisease d) =>
  d == Sick ? tests.ToWeighted(99, 1) : tests.ToWeighted(1, 99);

Aside: You might wonder how we know that the test is 99% accurate, and how we know that 1% of the population has the disease, particularly given the counterintuitive result I’m about to discuss. That’s a great question and I’m not going to get into the details in this series of how in the real world medical practitioners evaluate the accuracy of a test or the prevalence of a condition. Let’s just suppose that we know these facts ahead of time; after all, that’s why the prior is called the prior.

Question: you have just tested positive; what is the probability that you have the disease?

Most people, and even many doctors, will say “the test is 99% accurate, you tested positive, therefore there is a 99% chance that you have the disease”. But that is not at all true; we can compute the true result very easily now:

var posterior = prior.Posterior(likelihood);

And we get:


It’s fifty-fifty.


If a result is confusing, always look at the joint distribution:


(Sick, Positive):99
(Sick, Negative):1
(Healthy, Positive):99
(Healthy, Negative):9801

You tested positive. 99 out of every 10000 people are true positives, and 99 out of every 10000 people are false positives. We condition away the negatives, because you didn’t test negative, and what is left? 50% chance that you are positive, not 99%.

Aside: In this example if you test negative then you are not 99% likely to be negative; you are 99.99% likely to be negative! This is also counterintuitive to people.

Exercise: How good does the test have to be for you to have a 90% posterior probability of actually being positive given a positive result?

Bayesian inference is incredibly powerful and useful. We very frequently have good information on priors and likelihoods. We make observations of the world, and we need to figure out posteriors probabilities given those observations. I could list examples all day; a classic example in information technology is:

  • We can ask users to manually classify emails into spam and non-spam. That gives us a prior on P(Spam)
  • From that collection of spam and non-spam emails, we can find out which words are commonly found only in spam. That gives us a likelihood function, P(Words|Spam).
  • We then make an observation of a real email, and the question is: given the words in an email, what is the posterior probability that it is spam? That is, what is the function P(Spam|Words). If the probability passes some threshold, we can put the mail in the spam folder.

There are also real applications in sensor technology:

  • We have a machine in a factory which requires a part on a conveyor to stop moving before it is welded; we manually observe how often the part is stopped correctly, giving us a prior on P(Stopped)
  • We install a sensor that attempts to sense whether the part is stopped, and test its accuracy to obtain P(SensorReading|Stopped)
  • Now we have enough information to compute the posterior: given a certain reading from the sensor, what is the probability that the part has actually stopped moving? That is P(Stopped|SensorReading)
  • If we do not have a high enough probability that the part is actually stopped, we can delay the welding step until we have better evidence that the part has stopped.

There are even applications in developer tools!

  • We can gather information from open source repositories about how often certain functions are called, giving us a prior on P(Function called)
  • We can gather information from IDE keystrokes about how often a letter typed is ultimately the first letter of that function, giving us a likelihood function P(Keystrokes|Function called)
  • Now we have enough information to compute the posterior: given a certain set of recent keystrokes, what is the probability distribution on likely functions the user wishes to call? This could give us much better IntelliSense.

And so on. The opportunities for taking advantage of Bayesian inference are enormous. We really ought to have Bayesian inference on distributions in the basic toolbox of the language, the same way we have ints, doubles, strings, nullables, functions,  tasks, sequences, and so on, in that toolbox.

That’s what I mean by “Fixing Random”. The fundamental problem is not that Random has historically had a candy-machine interface; that’s just a silly historical accident that can be fixed. Rather: we’ve decided that monads like nullable, sequence, function and task are so important that they are included in the core runtime. Why? Not because they’re cool, but because having Nullable<T>, IEnumerable<T>,  Task<T>, and so on in the core runtime makes it much easier for developers to write correct, concise code that solves their problems.

Programming is increasingly about dealing with a world of unknowns; having operators in the language for concisely describing probabilistic workflows seems very valuable to me. This series seeks to make the case for that value.

Next time on FAIC: We’ll take a closer look at the discrete probability distribution type as a monad. We might be missing a concept.

Fixing Random, part 15

[Code is here.]

Last time on FAIC we made a correct, efficient implementation of SelectMany to bind a likelihood function and projection onto a prior, and gave a simple example. I deliberately chose “weird” numbers for all the weights; let’s do that same example again but with more “nice round number” weights:

var prior = new List<Height>() { Tall, Medium, Short }
  .ToWeighted(60, 30, 10);
IDiscreteDistribution<Severity> likelihood(Height h)
    case Tall: return severity.ToWeighted(45550);
    case Medium: return severity.ToWeighted(07030);
    defaultreturn severity.ToWeighted(001);
[… projection as before…]
.SelectMany(likelihood, projection)

This produces the output:


which is correct, but you notice how multiplying the weights during the SelectMany made for some unnecessarily large weights. If we then did another SelectMany on this thing, they’d get even larger, and we’d be getting into integer overflow territory.

Integer overflow is always possible in the system I’ve developed so far in this series, and I am deliberately glossing over this serious problem. A better implementation would either use doubles for weights, which have a much larger range, or arbitrary-precision integers, or arbitrary-precision rationals. I’m using integers to keep it simple, but as with many aspects of the code in this series, that would become problematic in a realistic implementation.

One thing we can do to tame this slightly is to reduce all the weights when possible; plainly in this case we could divide each of them by 10000 and have exactly the same distribution, so let’s do that. And just to make sure, I’m going to mitigate the problem in multiple places:

  • In SelectMany we could be taking the least common multiple (LCM) instead of the full product of the weights.
  • In the WeightedInteger factory we could be dividing out all the weights by their greatest common divisor (GCD).

Long-time readers of my blog may recall that I’ve implemented Euclid’s Algorithm before, but this time I’m going to make a much simpler implementation:

public static int GCD(int a, int b) => 
  b == 0 ? a : GCD(b, a % b);

We define the GCD of two non-negative integers a and b as:

  • if both zero, then zero
  • otherwise, if exactly one is zero, then the non-zero one
  • otherwise, the largest integer that divides both.

Exercise: Prove that this recursive implementation meets the above contract.

The problem we face though is that we have many weights and we wish to find the GCD of all of them. Fortunately, we can simply do an aggregation:

public static int GCD(this IEnumerable<int> numbers) => 

Similarly we can compute the LCM if we know the GCD:

public static int LCM(int a, int b) =>
  a * b / GCD(a, b);
public static int LCM(this IEnumerable<int> numbers) =>
numbers.Aggregate(1, LCM);

And now we can modify our WeightedInteger factory:

public static IDiscreteDistribution<int> Distribution(
  IEnumerable<int> weights)
  List<int> w = weights.ToList();
  int gcd = weights.GCD();
  for (int i = 0; i < w.Count; i += 1)
    w[i] /= gcd;

And our SelectMany:

int lcm = prior.Support()
  .Select(a => likelihood(a).TotalWeight())
[… and then use the lcm in the query …]

See the code repository for all the details. If we apply all these changes then our results look much better…


… and we are at least a little less likely to get into an integer overflow situation.

Aside: Of course we can do the same thing to the Bernoulli class, and normalize its weights as well.

Next time on FAIC: We can use the gear we’ve created so far to solve problems in Bayesian inference; we’ll see how.

So long, MSDN blog

UPDATE 3: Rock stars Scott Hanselman and Dan Fernandez and their colleagues have gotten my MSDN blog back up, and will also restore the late cbrumme’s blog as well. Thank you both, and everyone else at what I can only assume is the Microsoft Content Migration Disaster Mitigation Team for your prompt attention. I very much appreciate it.

I’m still going to migrate all my content over to though. 🙂

UPDATE 2: The awesome Scott Hanselman informs me that there has been a “hiccup” during migration, and that the intention was to archive the MSDN blogs in a read-only format with the same links; they should be back soon.

UPDATE 1: I see this has been linked from HackerNews; welcome, new readers. Normally this blog is not me complaining about Microsoft corporate decision making blunders. I’m currently on part 15 of a series on basic probabilistic programming in C#, so if that interests you, stick around!

For reasons unknown to me, my MSDN blog has been deleted without warning. (Microsoft, I would have appreciated a heads-up. It’s not like you don’t know how to reach me!)

This is unfortunate, since there are literally thousands of links to it spread over the internet that are now dead. And there was a lot of good historical content there. This is very disappointing.

Fortunately I have a backup of all the text, and the graphs and images can be recreated.

I’ve started putting up the old content here, but it will take some time to get it all formatted correctly and whatnot. So far I’ve made it through September 12, 2003, so one day down, many hundreds more to go.

Apparently all the old MSDN blogs are being taken down, which is a great loss. I relied upon old blogs like the late, great cbrumme’s blog to archive the early design decisions for .NET, and there are many others that will be missed.