Fixing Random, part 22

[Code for this episode is here.]

Last time in this series I left you with several challenges for improving our DSL for imperative probabilistic workflows. But first, a puzzle:


Question One: You are walking down the street when you see a can of gasoline, some matches, a house on fire, a wrench, a fire hose and a fire hydrant. You consider the house being on fire to be problematic, so what do you do?

.

.

.

.

Of course you connect the hose to the fire hydrant with the wrench and use the hose to extinguish the house.

Question Two: You are walking down the street when you see a can of gasoline, some matches, a house not on fire, a wrench, a fire hose and a fire hydrant. What do you do?

.

.

.

Set the house on fire with the gasoline and the matches; we have now reduced the problem to a previously solved problem, so we’re done.


That’s how compiler writers approach problems. We know how to do lowerings in a very simplified version of probabilistic C#; to lower more complicated forms, we just keep on applying the “lower it to a previously solved problem” technique:

  • goto is straightforward: you just evaluate the “statement method” that is the statement you’re going to.
  • Once you have goto and if, it is straightforward to implement do, while, break and continue; just reduce those to combinations of if and goto, which are previously solved problems.
  • for loops are just an irritatingly complicated version of while, which is a previously solved problem.
  • Implementing throw and try is tricky because it is a non-local goto. I might come back in a later episode and describe ways of taming throw and try in these sorts of workflows. For some thoughts on why this is tricky, see my past episodes on problems with iterator blocks. Recall also that use of various kinds of try blocks was originally restricted for async methods, for similar reasons.
  • We have not solved the problem for try, which means that we’re blocked on foreachlock and using, all of which are syntactic sugars for try
  • To allow the sample operator in more places, again, reduce the problem to previously solved problems. For example:

x = Foo(sample A(), sample B()[sample C()]) + sample D();

is equivalent to

var a = sample A();
var b = sample B();
var c = sample C();
var f = Foo(a, b[c]);
var d = sample D();
x = f + d;

And we already know how to lower all of those.

  • You’ve got to be careful with the operators that conditionally evaluate their operands (&& || ?? ?:), but you can lower those into if statements.
  • Assignments, increment and decrements used in expression locations can be handled with the same trick: turn it into a form that has only statements that we already know how to do.
  • For lambdas and anonymous methods, we’ve already stated up front that they need to be pure, which means that they should not be closed over values which change. Given that restriction, lambdas (and therefore also query expressions) can be allowed.
  • Moreover, we know how to turn lambdas into ordinary methods; if we apply the same restrictions to a lambda body as we do to probabilistic methods then we could allow probabilistic lambdas; that is, lambdas that return probability distributions and contain sample operators and condition statements. This is much the same as the way C# allows async lambdas.

Aside: If actually implementing this scheme sounds super complicated with lots of details to keep track of and performance pitfalls everywhere, congratulations, you now know what it is like to be a compiler developer!

Compiler developers have to do all of this stuff even without fancy kinds of workflows in the language; everything that I am describing is a problem that has to be solved during the codegen phase of a compiler, because IL does not have concepts like while or for or “assignment as an expression”. It’s all about reducing programs to simpler and simpler forms, until you get it down to the IL.


Let’s assume for now that we can make all those basic enhancements to our imperative DSL, and address the question I left you with last time: is this sugar for queries just a sugar, or is there something we can do with it that we can’t do with a query?

Yes and no.

As we’ve seen, ultimately everything gets desugared into calls to SelectMany, so no. Anything you can do with this notation, you can do with a SelectMany, and therefore you could write a query comprehension. However, this notation certainly makes it much easier to clearly and concisely represent operations that would look pretty odd in a query.

For example, we don’t normally think of queries as being recursive, but recursive workflows are straightforward in this DSL, so maybe there is an argument that yes, there is some more expressiveness in the statement form.

How’s that, you say? Let’s look at an example of a recursive workflow.

Let’s play a game in n rounds; we’ll state up front what n is. You flip a coin. If it is heads, I pay you n dollars and we’re done. If it is tails, we go to the next round. In the next round, you flip a coin. If it is heads, I pay you n-1 dollars, if tails, we go to the next round, and so on. If we get down to zero dollars, the game is over, and you get nothing. (The recursion has to stop somewhere!)


Aside: Obviously we do not have to implement this using recursion; we could use a loop. My point is that we easily can implement it with recursion, so recursion is a tool we have available in our toolbox when building probabilistic workflows.

Exercise: what would a fair price be for you to pay me up front so that the game is on average a net gain for neither you nor me? The answer is below.


I’m going to assume that we’ve made all the straightforward improvements to our DSL, including allowing probabilistic expression-bodied methods. We can implement the logic of our game like this:

probabilistic IDiscreteDistribution<int> Game(int rounds) =>
  (rounds <= 0 || sample Flip()) ?
    rounds :
    sample Game(rounds  1);

We can do an initial lowering to a simpler form:

probabilistic IDiscreteDistribution<int> Game(int rounds)
{
  S0: if (rounds <= 0) goto S5;  
  S1: x = sample Flip();
  S2: if (x) goto S5;  
  S3: y = sample Game(rounds  1)
  S4: return y;
  S5: return rounds;
}

And then lower that form into actual runnable C# 7. (I’ll elide the unnecessary locals and temporaries.)

static IDiscreteDistribution<int> Game(int rounds)
{
  IDiscreteDistribution<int> S0() =>
    rounds <= 0 ? S5() : S1();
  IDiscreteDistribution<int> S1() =>
    Flip().SelectMany(x => S2(x));
  IDiscreteDistribution<int> S2(bool x) =>
    x ? S5() : S3();
  IDiscreteDistribution<int> S3() =>
    Game(rounds  1).SelectMany(y => S4(y));
  IDiscreteDistribution<int> S4(int y) =>
    Singleton<int>.Distribution(y);
  IDiscreteDistribution<int> S5() =>
    Singleton<int>.Distribution(rounds);
  return S0();
}

(Inlining that into a query is left as an exercise.)

I asked above what is the fair price to pay; that’s the expected value. We can work that out directly from the distribution:

public static double ExpectedValue(
  this IDiscreteDistribution<int> d) =>
    d.Support().Select(s => 
      (double)s * d.Weight(s)).Sum() / d.TotalWeight();

And if we run it, we get the answers we seek:

Console.WriteLine(Game(5).ShowWeights());
Console.WriteLine(Game(5).ExpectedValue());

5:16
4:8
3:4
2:2
1:1
0:1
4.03125

You, who know probability theory, could have predicted the results: out of a large set of games, on average you get paid 5 half the time, 4 a quarter of the time, and so on.

If you paid me $4.00 for the chance to play this game in five rounds, on average you’d come out ahead by a few cents. Half the time you’d net a dollar, a quarter of the time you’d break even, and the rest of the time you’d lose some or all of your money.

Again the thing that is so exciting about this technique is: we humans didn’t need to work out either the weights or the expected value; we wrote a program describing the game and executing that program produced the solution for the exact distribution of the results!

In a previous series I showed how re-implementing algorithms that use doubles to instead use dual numbers can make the runtime compute the exact value of the derivative of a function at a point while it is computing the value of the function at that point. We no longer have to manually do the work to compute a derivative, and nor do we have to approximate the derivative using numerical methods, and nor do we have to do symbolic differentiation; we just imperatively describe how the value is computed, and the runtime does a little extra work along the way to exactly compute the derivative.

What I’ve shown here is that we can do pretty much the same thing for small discrete probability distributions! We don’t have to work out what the distribution of this silly game is. Nor do we have to simulate the game, collect a hundred thousand data points, and make a statistical approximation of the distribution. We just describe the flow of the game in imperative code, run the method, and the thing that pops out the other end is the exact distribution!  It’s almost magical.


Next time on FAIC: It can’t really be that easy, can it?

Fixing Random, part 21

Last time in this series I proposed a stripped-down DSL for probabilistic workflows. Today, let’s see how we could “lower” it to ordinary C# 7 code. I’ll assume of course that we have all of the types and extension methods that we’ve developed so far. (Code for this episode is here.)

The first thing I’m going to do is describe a possible but very problematic way to do it.

We know how C# lowers methods that have yield return statements: it generates a class that implements IEnumerable, rewrites the method body as the MoveNext method of that class, and makes the method return an instance of the class. We could do the same thing here.

Recall that last time I gave three examples, the longest of which was:

probabilistic static IDiscreteDistribution<string> Workflow(int z)
{
  int ii = sample TwoDSix();
  if (ii == 2)
    return “two”;
  condition ii != z;
  bool b = sample Flip();
  return b ? “heads” : ii.ToString();
}

We could lower this to:

static IDiscreteDistribution<string> Workflow(int z) =>
  new WorkflowDistribution(z);

sealed class WorkflowDistribution :
  IDiscreteDistribution<string>
{
  int z;
  public WorkflowDistribution(int z)
  {
    this.z = z;
  }
  public string Sample()
  {
    start:
    int ii = TwoDSix().Sample();
    if (ii == 2) 
      return “two”;
    if (!(ii != z)) 
      goto start;
    bool b = Flip().Sample();
    return b ? “heads” : ii.ToString();
  }
  public IEnumerable<string> Support() => ???
  public int Weight(string t) => ???
}

We’d need to make sure that we did the right thing if z got modified during the workflow and the condition was not met of course, but that’s a small detail.

We could similarly lower the other two; I won’t labour the point. It’s a lot of boilerplate code.

The good news is: we get the right distribution out of this thing:

Console.WriteLine(Workflow(3).Histogram());

    4|***
    5|****
    6|*****
    7|*******
    8|******
    9|****
   10|***
   11|**
   12|*
  two|**
heads|****************************************

The bad news is:

  • We haven’t computed the support or the weights, as required.
  • We are once again in a situation where a condition causes a potentially long-running loop to execute; we could do hundreds or thousands of iterations of going back to “start” for every successful sample if the condition was rarely met.

Basically, we’ve lost the great property that we had before: that we automatically get the right discrete distribution object inferred.

So let’s state unequivocally right now what our goal here is. The goal of this feature is to take a method body that describes a probabilistic workflow that includes conditions and produce a semantically equivalent distribution that does not actually contain any loop-causing condition, the same as we did for query-based workflows.

In our exploration of creating a posterior, we’ve seen that we can take a query that contains Where clauses and produce a projected weighted integer distribution that matches the desired distribution. Though there is some cost to producing that distribution object, once we’ve paid that cost there is no additional per-sample cost. We can do the same here, by leveraging the work we’ve already done.

The technique I’m going to use is as follows:

  • Each statement in the method will be numbered starting from zero; statements nested inside if statements are of course statements.
  • Each statement will be replaced by a local method named Sn, for n the number of the statement.
  • Each local method will take as its arguments all the locals of the method. (If there are any assignments to formal parameters then they will be considered locals for this purpose.)
  • Each local method will return a probability distribution.

At this point I could give you the general rule for each kind of statement in my stripped down language, but let’s not bother. Let’s just lower the methods and it will become obvious what we’re doing.

We’ll start with the easy one.

probabilistic IDiscreteDistribution<bool> Flip()
{
  int x = sample Bernoulli.Distribution(1, 1);
  return x == 0;
}

we rewrite this as:

IDiscreteDistribution<bool> Flip()
{
  IDiscreteDistribution<bool> S0(int x) =>
    Bernoulli.Distribution(1, 1)
      .SelectMany(_x => S1(_x));
  IDiscreteDistribution<bool> S1(int x) => 
    Singleton<bool>.Distribution(x == 0);
  return S0(0);
}

Read that over carefully and convince yourself that this implements the correct logic, just in a much more convoluted fashion.


Aside: Notice that we are taking advantage of the monad laws here regarding the relationship between binding and the function that produces a Singleton. It might not have been clear before why this is an important monad law; hopefully it is now clear.

Aside: Recall that in the jargon of monads, the unit operation that creates a new wrapper around a single instance of the underlying type is sometimes called return. It is no coincidence that our lowering turns return statements into singletons!


Now let’s lower the next one:

probabilistic IDiscreteDistribution<int> TwoDSix()
{
  var d = SDU.Distribution(1, 6);
  int x = sample d;
  int y = sample d;
  return x + y;
}

This becomes this mess:

static IDiscreteDistribution<int> TwoDSix()
{
  IDiscreteDistribution<int> S0(
      IDiscreteDistribution<int> d, int x, int y) =>
    S1(StandardDiscreteUniform.Distribution(1, 6), x, y);
  IDiscreteDistribution<int> S1(
      IDiscreteDistribution<int> d, int x, int y) =>
    d.SelectMany(_x => S2(d, _x, y));
  IDiscreteDistribution<int> S2(
      IDiscreteDistribution<int> d, int x, int y) =>
    d.SelectMany(_y => S3(d, x, _y));
  IDiscreteDistribution<int> S3(
      IDiscreteDistribution<int> d, int x, int y) =>
    Singleton<int>.Distribution(x + y);
  return S0(null, 0, 0);
}

Again, walk through that and convince yourself that it’s doing the right thing. This implementation is ridiculously overcomplicated for what it does, but you should have confidence that it is doing the right thing. Remember, we’re trying to prove that the proposed language feature is in theory possible first, and then we’ll think about ways to make it better.

Finally, let’s do one with an if and a condition:

probabilistic IDiscreteDistribution<string> Workflow(int z)
{
  int ii = sample TwoDSix();
  if (ii == 2)
    return “two”;
  condition ii != z;
  bool b = sample Flip();
  return b ? “heads” : ii.ToString();
}

This lowers to:

static IDiscreteDistribution<string> Workflow(int z)
{
  IDiscreteDistribution<string> S0(int ii, bool b) =>
    TwoDSix().SelectMany(_ii => S1(_ii, b));
  IDiscreteDistribution<string> S1(int ii, bool b) =>
    ii == 2 ? S2(ii, b) : S3(ii, b);
  IDiscreteDistribution<string> S2(int ii, bool b) =>
    Singleton<string>.Distribution(“two”);
  IDiscreteDistribution<string> S3(int ii, bool b) =>
    ii != z ? S4(ii, b) : Empty<string>.Distribution;
  IDiscreteDistribution<string> S4(int ii, bool b) =>
    Flip().SelectMany(_b => S5(ii, _b));
  IDiscreteDistribution<string> S5(int ii, bool b) =>
    Singleton<string>.Distribution(b ? “heads” : ii.ToString());
  return S0(0, false);
}

And at last we can find out what the distribution of this workflow is:

Console.WriteLine(Workflow(3).ShowWeights());

gives us:

  two:2
heads:33
    4:3
    5:4
    6:5
    7:6
    8:5
    9:4
   10:3
   11:2
   12:1

We can mechanically rewrite our little probabilistic workflow language into a program that automatically computes the desired probability distribution.

It should now be pretty clear what the rewrite rules are for our simple DSL:

  • Normal assignment and declaration statements invoke the “next statement” method with new values for the locals.
  • Assignment or declaration statements with sample become SelectMany.
  • The condition statement evaluates their condition and then either continues to the next statement if it is met, or produces an empty distribution if it is not.
  • if statements evaluate their condition and then invoke the consequence or the alternative helper method.
  • return statements evaluate an expression and return a Singleton of it.

Exercise: I severely restricted the kinds of statements and expressions that could appear in this language. Can you see how to implement:

  • while, do, for, break, continue, switch and goto/labeled statements
  • compound assignment, increment and decrement as statements

Exercise: What about assignments, increments and decrements as expressions; any issues there?

Exercise: What about lambdas? Any issues with adding them in? Again, we’re assuming that all functions called in this system are pure, and that includes lambdas.

Exercise: I severely restricted where the sample operator may appear, just to make it easier on me. Can you see how we might ease that restriction? For example, it would be much nicer to write TwoDSix as:

probabilistic IDiscreteDistribution<int> TwoDSix()
{
  var d = SDU.Distribution(1, 6);
  return sample d + sample d;
}

How might you modify the lowering regimen I’ve described to handle that?

We’ll discuss possible answers to all of these in future episodes.


As I’ve noted repeatedly in this series: of course this is not how we’d actually implement the proposed feature. Problems abound:

  • I’ve absurdly over-complicated the code generation to the point where every statement is its own local function.
  • We are passing around all the locals as parameters; probably some of them could be realized as actual locals.
  • Each method is tail recursive, but C# does not guarantee that tail recursions are optimized away, so the stack depth could get quite large.
  • Several of the methods take arguments that they do not ever use.
  • We could have made some simple inlining optimizations that would decrease the number of helper methods considerably.

That last point bears further exploration. Imagine we applied an inlining optimization over and over again — that is, the optimization where we replace a function call with the body of the function. Since every function call here is expression-bodied, that’s pretty easy to do!


Exercise: Do so; the answer follows. (Note that I asked how to write this workflow as a query last time; did you get the same answer?)


.

.

.

.

What would come out the other end of repeated inlining is:

static IDiscreteDistribution<string> Workflow(int z) =>
  TwoDSix().SelectMany(ii => 
    ii == 2 ?
      Singleton<string>.Distribution(“two”) :
      ii != z ?
        Flip().SelectMany(b => 
          Singleton<string>.Distribution(b ? 
            “heads” : 
            ii.ToString())) : 
        Empty<string>.Distribution);

Which is hardly any easier to read, but at least it is not quite so many functions.

Is this then how we could implement this feature for real? Lower it to functions, then inline the functions? Though it would work, that’s probably not how I’d do it for real. I might get to discuss a more realistic lowering technique in a future episode, but we’ll see how it goes. This series is getting pretty long and we’re not done yet!

And again, my purpose here was to show that it could be done with a straightforward, rules-based transformation; I’m not looking to find the optimal solution. It can be done, which is great!


Aside: I said a few episodes back that it would become clear why having an explicitly empty distribution is a win; hopefully it is now clear. Having an explicitly empty distribution allows us to implement condition using SelectMany! It is not at all obvious how to rewrite the workflow above into a query with a Where clause, but as we’ve seen, implementing the condition statement as conditionally producing an empty distribution gives us the ability to stick what is logically a Where anywhere in the workflow.

Aside: Many episodes back I gave the challenge of implementing a Where that took a probabilistic predicate: Func<T, IDiscreteDistribution>.  That is no challenge at all in our new workflow language:

bool b sample predicate(whatever);
condition b;


I sincerely hope that you agree that my proposed “imperative style” workflow is an improvement over our use of LINQ to project, condition and combine probability distributions; our Workflow() method is a lot easier to read as an imperative workflow than its query form, and not just because it uses the jargon of probability rather than that of sequences.


Next time on FAIC: It looks like we’ve come up with a reasonable syntactic sugar, but is it just a sugar? Is there anything that our “imperative” workflow can do that our “LINQ” workflow cannot?

Fixing Random, part 20

Without further ado, here’s my proposed stripped-down C# that could be a DSL for probabilistic workflows; as we’ll see, it is quite similar to both enumerator blocks from C# 2 and async/await from C# 5. (Code for this episode can be found here.)

  • The workflow must be a function with return type of IDiscreteDistribution<T> for some T.
  • Just as methods that use await must be marked async, our DSL methods must be marked probabilistic.
  • The function has an ordinary function body: a block of statements.

Obviously I am using C# 5 asynchronous workflows as an example of how to add a new mechanism to C#. The statements and the expressions in a probabilistic method are restricted as follows:

  • Declaration statements must have an initializer.
  • All the locals must have unique names throughout the method.
  • Assignments can only be in declarations or as statements; no use of assignments in the middle of an expression is allowed.
  • No compound assignments, increments or decrements.
  • No await, yield, try, throw, switch, while, do, break, continue, goto, fixed, lock, using or labeled statements allowed. What’s left? if and return statements are fine, as are blocks { }. I told you I was stripping things down!
  • No lambdas, anonymous functions or query comprehensions.
  • No use of in, out or ref.
  • All the other perfectly normal operations are allowed — function calls, object initializers, member access, indexers, arithmetic, that’s all just fine.
  • However, all function calls and whatnot must be pure; there can be no side effects, and nothing should throw.

Basically what I’m doing here is making a little super-simple subset of C# that doesn’t have any of the weird, hard stuff that keep compiler writers busy. In this pleasant world locals are always assigned, there are no closures, there are no worries about what happens when an exception is thrown, and all that sort of stuff.

This is pretty draconian, I know. We’ll sketch out how to soften some of those restrictions in later episodes. I want to show that we can describe how to implement a simple core of a language; harder features can come later.

To this little language I’m going to add a new unary operator and a new statement. The new operator is sample, and it may appear only as the right operand of an assignment, or the initializer of a declaration:

int x = sample WeightedInteger.Distribution(8, 2, 7, 3);

The operand must be of type IDiscreteDistribution<T>, and the type of the expression is T. Again, this should feel familiar if you’re used to await.

The new statement is condition, and it has the form

condition expression;

The expression must be convertible to bool. The meaning of this thing is much the same as Where: if the Boolean condition is not met, then a value is filtered out from the distribution by setting its weight to zero. But I’m getting ahead of myself.


Aside: Last time I talked a bit about how language designers must choose how general or specific a language element is, and that this must be reflected in syntax; this is a great example of such a choice.

I’ve chosen condition because I think of this operation as creating a conditional distribution; I could have said where instead of condition and had it mean basically the same thing, but that would be moving towards the established C# jargon for sequences, which I’m explicitly trying to avoid.

However, as we’ve seen, a primary usage case for a conditional distribution is computing the posterior distribution when given a prior and a likelihood. But why do we wish to compute the posterior at all? Typically because we have observed something. That is, the development cycle of the probabilistic program is:

  • Before the program is written, data scientists and developers compute priors, like “What percentage of emails are spam?”
  • They also compute likelihood functions: “what word combinations are likely, given that an email is spam? What word combinations are likely given that an email is not spam?”
  • A spam detection program must now answer the question: given that we have observed an email with certain word combinations, what is the posterior probability that it is spam? This is where the probabilistic workflow comes in.

For that reason, we could reasonably choose observation or observe as our keyword here, instead of condition, emphasizing to the developer “the reason you use this feature is to express the computation of a posterior distribution for a given observation”. That would make the feature feel a little bit less general, but might help more clearly express the desired semantics in the mind of the typical programmer designing a workflow that computes a posterior.

I’m going to stick with condition, but I just wanted to point out that this is a choice that has user experience consequences, were we actually to implement this feature in a language like C#.


Let’s look at some examples:

probabilistic IDiscreteDistribution<bool> Flip()
{
  int x = sample Bernoulli.Distribution(1, 1);
  return x == 0;
}

What I would like is for this method to have the exact same semantics as if I had written:

 IDiscreteDistribution<bool> Flip() => 
   Bernoulli.Distribution(1, 1).Select(x => x == 0);

What do you think about these two implementations? I think the former looks a lot more like the straightforward, imperative code that we’re used to.

Let’s look at another:

probabilistic IDiscreteDistribution<int> TwoDSix()
{
  var d = SDU.Distribution(1, 6);
  int x = sample d;
  int y = sample d;
  return x + y;
}

Again, it seems to me that this is more clear than

IDiscreteDistribution<int> TwoDSix() 
{
  var d = SDU.Distribution(1, 6);
  return from x in d from y in d select x + y;
}

And it certainly seems more clear, particularly to the novice, than

IDiscreteDistribution<int> TwoDSix() 
{
  var d = SDU.Distribution(1, 6);
  return d.SelectMany(x => d, (x, y) => x + y);
}

LINQ is awesome for sequences, but I think the statement-based workflow is much easier to understand for distributions.

Now let’s look at a really complicated workflow, this one with conditioning:

probabilistic IDiscreteDistribution<string> Workflow(int z)
{
  int ii = sample TwoDSix();
  if (ii == 2)
    return “two”;
  condition ii != z;
  bool b = sample Flip();
  return b ? “heads” : ii.ToString();
}

The first two were easy to see how they corresponded to query syntax, but what even is the distribution represented by this workflow? It depends on the value of the parameter z, for one thing.

What I want here is: when you call this method, you get a distribution back. When you Sample() that distribution, it should logically have the same effect as: all the sample operations Sample() their operand, and the returned value is, well, the returned value. However, if any condition is false then we abandon the current attempt and start over.

The trick is implementing those semantics without actually running the body in a loop!


Exercise: Pick a value for z; let’s say 3. See if you can work out what the distribution of strings is that should come out the other end of this thing. I’ll give the answer in the next episode.

Exercise: If you had to represent this workflow as a query, how would you do it?


Next time on FAIC: Suppose we were writing a compiler for this feature. We know that LINQ works by turning query comprehensions into function calls; we know that the compiler turns async workflows and iterator blocks build state-machine coroutines. How might we lower this new kind of code into ordinary C# 7 code? Probably somehow using our existing query operators… but what exactly would work?

 

 

 

Fixing Random, part 19

I’ve got no code for you this time; instead here are some musings about language design informed by our discussion so far.

One of the most important questions to ask when designing a language feature is: what should the balance be between generality and specificity of a feature?

My go-to example is LINQ, and what we’ve been talking about for the last dozen episodes illustrates the problem nicely.

As we’ve seen many times in this blog, and particularly in this series, query comprehensions allow us to concisely express projection (Select), binding (SelectMany) and use of the zero of an additive monad (Where) to express a condition. Because C# uses a syntactic transformation to implement LINQ, we can express these concepts on an instance of any monad that has the right methods. The compiler does not know or care whether the thing being selected from is a sequence, a probability distribution, whatever, doesn’t matter. All that matters is that a method with the right signature is available as an instance or extension method.

The language designers for C# 3.0 were therefore faced with a choice: how general should the LINQ feature be? There’s a spectrum of possibilities; a few points on that spectrum are:

  • We all know that a monad is just a monoid in the category of endofunctors, which implies that there are other patterns like “monoid” and “category” and “functor” that we could be describing in the type system; what’s so special about monads? We could create a whole “higher-kinded” type system to describe generic type patterns; “monad” is just one possible pattern.
  • We could embed monads as a special kind of thing right into the language, using the vocabulary of monads: bind, unit and so on. In this language, sequences are just a special case of the more general monad feature.
  • The feature could be made specifically about operations on sequences: not just Select, SelectMany and Where which make sense on multiple monads, but also operations that do not apply across monads, like GroupBy, Join and OrderBy. The operations could be applicable to any data source that supports those operations, whether XML documents or database tables or plain old lists, but the concepts in the API are tailored to the domain of data in the form of a sequence of things, but it has to be a sequence.
  • The feature could be narrowly tailored to the needs of a particular data technology; LINQ could have been simply allowing SQL Server query syntax to be embedded directly in the language, and it could have been made to work only with SQL Server and no other dialects of SQL or other database back ends.

The designers of C# chose a point on the spectrum slightly more general than the third point: the feature is not written in the jargon of monads, but it is more general than simple operations on sequences. The designers of Haskell chose maximum generality, and so on.

These are of course not the only options; there were any number of points on that spectrum where the language design could have fallen, and different language designers have chosen different points. Think about list comprehensions in Python, for instance; they are obviously similar to LINQ in many respects, but you’re not going to be marshaling a list comprehension to a database or using it as a generalized monadic bind operation any time soon. There’s nothing wrong with that; it was a reasonable choice for Python.

Our exploration of representing distributions as a monad and combining them using LINQ operators and comprehensions illustrates the pros and cons of C#’s choice. Though I think it is delightful that we can use Select, SelectMany and Where on probability distributions as though they were infinite sequences, and pleased that we can optimize the results using the laws of probability for discrete distributions, the language designer part of me feels like this is somehow wrong. We’re using the vocabulary of sequences, not the vocabulary of probability, and it feels out of place.

This in particular was brought into focus for me in my April 1st bonus episode. Though it was in parts deliberately obtuse and misleading, there was no real joke there; rather, I wished to illustrate a serious point with a silly example. Consider these two queries:

from car in doors
from you in doors
from monty in (
    from m in doors 
    where m != car 
    where m != you 
    select m)
select (car, you, monty)

and

from car in doors
from you in doors
from monty in doors 
where monty != car 
where monty != you 
select (car, you, monty)

Plainly we can refactor the first into the second if doors is a sequence; this program transformation preserves semantics for sequences. But if doors is a distribution then the refactoring preserves the support but changes the weights.

This is some evidence that query comprehensions are perhaps not the best way to express operations on distributions: our intuitions about what refactorings are correct is heavily influenced by our understanding of sequence semantics, and this could lead to bugs.

Moreover, think about how the design of C# has evolved with respect to monadic types:

  • C# 2 added nullable types and embedded nullable arithmetic and conversions in the language, but did not provide a general mechanism for lifting function calls and other operations to nullables until C# 6 added ?.
  • C# 2 added statement-based sequence workflows in the form of yield return.
  • C# 3 added query comprehensions for composing operations on sequences in an expression context
  • Reactive extensions (Rx) leveraged LINQ to support observables — push-based sequences — but did not add any new syntax; it could do this because the duality between pull- and push-based sequences is strong and our intuitions largely still give correct outcomes.
  • C# 5 added statement-based asynchronous workflows; though you can make async lambdas and use the await operator in the middle of an expression, asynchronous workflows are fundamentally collections of statements, not fluent compositions of expressions like query comprehensions.

You’ll notice that there is no overarching philosophy here; rather, as the needs of real-world developers evolve, the language evolves to represent different concepts at different levels of generality.

Async workflows could have been made to look just like Rx query expressions; after all, a task is logically an observable that pushes a single value instead of multiple values. But the designers of C# figured that it would be more natural for developers to mix asynchrony into their statement-based workflows. Similarly, nullables could be treated as sequences that contain zero or one value, but they’re not.

The lesson of C# here is: when you’re trying to represent a new concept, try creating a domain-specific subset of the language that solves the problem using the vocabulary of the problem, but with enough generality to be extensible in the future.

Could we do the same for stochastic workflows? Using LINQ as our representation has been cute, fun and educational, but as language designers we can recognize that real-world users would likely find this choice confusing and weird; distributions are like sequences in many ways, but they are also dissimilar in a lot of ways.

Can we come up with a domain-specific language subset that better matches the domain of probabilistic programming, but preserves (or improves upon) the nice properties that we’ve seen so far? We’ve seen that we can express a discrete-distribution workflow as a comprehension and automatically get an inferred distribution out the other end; could we do the same thing in a statement-based workflow language, akin to async/await?


Next time on FAIC: I’m going to sketch out a super-stripped down version of C# and add two new statements for probabilistic workflows. We’ll then show how that stripped-down version could be lowered to ordinary C# 7.

Fixing Random, part 18

Before that silly diversion I mentioned that we will be needing the empty distribution; today, we’ll implement it. It’s quite straightforward, as you’d expect. [Code for this episode is here.]

public sealed class Empty<T> : IDiscreteDistribution<T>
{
  public static readonly Empty<T> Distribution = new Empty<T>();
  private Empty() { }
  public T Sample() =>
    throw new Exception(“Cannot sample from empty distribution”);
  public IEnumerable<T> Support() =>
    Enumerable.Empty<T>();
  public int Weight(T t) => 0;
}

Easy peasy. Now that we have this, we can fix up our other distributions to use it. The WeightedInteger factory becomes:

public static IDiscreteDistribution<int> Distribution(
  IEnumerable<int> weights)
{
  List<int> w = weights.ToList();
  if (w.Any(x => x < 0))
    throw new ArgumentException();
  if (!w.Any(x => x > 0))
    return Empty<int>.Distribution;
[…]

And the Bernoulli factory becomes:

public static IDiscreteDistribution<int> Distribution(
  int zero, int one)
{
  if (zero < 0 || one < 0)
    throw new ArgumentException();
  if (zero == 0 && one == 0)
    return Empty<int>.Distribution;
[…]

And the StandardDiscreteUniform factory becomes:

public static IDiscreteDistribution<int> Distribution(
  int min, int max)
{
  if (min > max)
    return Empty<int>.Distribution;
[…]

And the Projected factory becomes:

public static IDiscreteDistribution<R> Distribution(
  IDiscreteDistribution<A> underlying, Func<A, R> projection)
{
  var result = new Projected<A, R>(underlying, projection);
  if (result.weights.Count == 0)
    return Empty<R>.Distribution;
[…]

And one more thing needs to change. Our computation in SelectMany assumed that none of the total weights are zero. Easily fixed:

int lcm = prior.Support()
  .Select(a => likelihood(a).TotalWeight())
  .Where(x => x != 0)
  .LCM();

We also have a division by total weight; don’t we have to worry about dividing by zero? Nope. Remember, the empty distribution’s support is the empty sequence, so when we then say:

var w = from a in prior.Support()
        let pb = likelihood(a)
        from b in pb.Support()
        group prior.Weight(a) * pb.Weight(b) *
          lcm / pb.TotalWeight()
        by projection(a, b);

If prior.Support() is empty then the whole query is empty and so the division is never executed. If prior.Support()is not empty but one of the pb.Support() is empty then there is nob from which to compute a group key. We never actually divide by total weight, and so there is no division by zero error to avoid.

That was relatively painless, but it is probably still very unclear why we’d ever need an empty distribution. It seems to be like a little bomb hiding in the program, waiting for someone to sample it. Have we just committed another “null reference” design fault? In a few episodes we’ll see what benefits justify the costs.


Next time on FAIC: We’ve been having a lot of fun treating distributions as monads that we can use query comprehensions on, but is that really the best possible syntax?

Fixing Random, bonus episode 1

I just thought of a really cute application of the stochastic workflow technology we’ve been working on; most of the series has already been written but it fits in here, so I’m going to insert this extra bonus episode. We’ll implement the zero value next time.

Code for this bonus episode is here.


You are probably familiar with the famous “Monty Hall” problem, but if not, here it is:

800px-Monty_hall_abc_tv.JPG

  • You’re on a game show hosted by Monty Hall, the handsome Canadian fellow pictured above.
  • Before you there are three closed doors; behind a uniformly randomly selected door there is a new car; behind the other two there is nothing.
  • You get to choose a door, and you get what is behind it as a prize: either a car, or nothing.
  • You randomly choose a door, again by some uniform process.
  • Monty — who knows where the car is — now always opens a door that meets two criteria: it does not have the car behind it, and it is not the door you chose.
  • To clarify: if you chose the door with the car, Monty chooses one of the remaining two doors by a uniform random choice. If you chose a door without the car, Monty only has one door he can open, and he opens that one.
  • Monty gives you the opportunity to switch your choice to the other still-closed door.
  • Assuming you wish to maximize your probability of winning the car, should you switch doors or stay with your original choice?

Aside: I’ve tried to be very precise in my description of the game for the purposes of our analysis. In the real game as played on television there were irrelevant details such as: the “prizes” behind the other two doors were goats or other bizarre, undesirable items, and so on. But there were also germane differences between the real game and our model above; for example, in the real game Monty would sometimes offer choices like “do you want to switch your door, or forget about the doors entirely and take the prize that is in this box?” and it is not clear by what process Monty decided to offer those choices. In this simplified version of the game I’ve removed all human agency from Monty; for our purposes, Monty is just a machine that is following an algorithm that involves generating random outcomes.


Exercise 1: If you don’t already know the solution, work it out. The answer is below.


 

.

.

.

.

.

You are two-to-one more likely to win the car if you switch than if you stay. But don’t take my word for it. Let’s solve the problem with computers, not by thinking!

Plainly the key to the problem is what is the distribution of Monty’s choice? Monty chooses a random door, but is observed to not pick a door with a car or the door which you picked. We can represent that as a two-parameter likelihood function:

IDiscreteDistribution<int> doors = SDU.Distribution(1, 3);
IDiscreteDistribution<int> Monty(int car, int you) =>
    from m in doors 
    where m != car 
    where m != you 
    select m;

There’s no logical difficulty in adding more parameters to a likelihood function; think of the parameters as a tuple if that makes you feel better.

Now we can answer the question. Here’s the probability distribution of winning if you do not switch:

var noSwitch1 = 
  from car in doors
  from you in doors
  from monty in Monty(car, you)
  select car == you ? “Win” : “Lose”;
Console.WriteLine(noSwitch1.ShowWeights());

And the output is:

Win:1
Lose:2

As predicted by thinking, you are twice as likely to lose if you do not switch. Computers for the win!


Exercise 2: Wait a minute, we never even used the value of range variable monty in the query. How is it possible that adding a from clause to the query changes its outcome when the sampled value is not even used?!?

Give this some thought and answer in the comments.


Exercise 3: OK smart person, if you thought that one was easy, take a look at this one.

We have our likelihood function Monty() which is just a query comprehension, and our noSwitch1 which is also just a query comprehension. We can make the program a little bit shorter by combining them together in the obvious way:

var noSwitch2 = 
  from car in doors
  from you in doors
  from monty in doors
  where monty != car 
  where monty != you 
  select car == you ? “Win” : “Lose”;

And if we print out the weights of that one… uh oh.

Win:1
Lose:1

I would have thought this program fragment to be logically the same as before, but this gives weights of 1:1 when we know the correct answer is 1:2.

Where did I go wrong?

Again, answer in the comments.


Next time on FAIC: Let’s get back on track from this silly diversion! We were talking about the zero value, so let’s implement it.

Fixing Random, part 17

Before we get going on today’s episode of FAIC, you might want to refresh your memory of what an additive monad is; I wrote an episode of my monad series on this subject. Briefly, an additive monad is a monad where there is a “zero value”; like the number zero, “multiplying” by zero produces a zero, and “adding” a zero is an identity.

For example, the sequence monad, IEnumerable<T>, has a zero: the empty sequence. If we Select or SelectMany from the empty sequence — our analog of “multiplying” — we get an empty sequence.  If we concatenate an empty sequence onto another sequence — the sequence analog of “adding” — we get the original sequence.

All additive monads can have a Where function defined on them; if we wanted to implement Where for sequences and didn’t care about performance, we could implement it like this:

public static IEnumerable<T> Single<T>(T t)
{
  yield return t;
}
public static IEnumerable<T> Zero<T>()
{
  yield break;
}
// Non-standard Where:
public static IEnumerable<T> Where<T>(
    this IEnumerable<T> items,
    Func<T, bool> predicate) =>
  from a in items
  from b in predicate(a) ? Single(a) : Zero<T>()
  select b;

That’s slow and produces hideous collection pressure, but it works; our actual implementation of Where is just an optimization.

What about the converse? Our probability monad IDiscreteDistribution<T> has a Where function defined. We definitely have a Singleton<T> type. But our implementation of the distribution monad does not appear to have a zero value. It seems plausible that there should be a way to express Where on distributions as we did with the sequence monad: as a SelectMany that produces either the single or zero distributions based on the predicate.

Give that some thought, and then scroll down when you think you have it worked out.

.

.

.

.

.

.

Just as the zero of the sequence monad is the empty sequence, the zero of the distribution monad is the empty distribution. That is, the distribution with an empty support that throws every time it is sampled.

We never implemented this value because every distribution class we’ve created already throws when you try to create an empty distribution:

  • StandardDiscreteInteger throws if the range is empty.
  • Bernoulli and WeightedInteger both throw if you give them all zero weights.
  • In our current implementation a Where clause where the predicate is false for everything in the support of the underlying collection will eventually throw.
  • In our original implementation, a Where clause where the predicate is always false hangs when sampled, but does not throw.
  • Our implementation of Select throws if the support is empty.
  • And so on.

Exercise: We have learned the following facts:

  • The zero value of the discrete distribution monad is the empty distribution.
  • The joint distribution produced by SelectMany is the analog of multiplication of two distributions.
  • Concatenation is the “addition” of the sequence monad. (The two sequences have to be of the same element type.)

I think it is pretty clear that doing a SelectMany on an empty distribution has to produce an empty distribution. But we still have a mystery to solve: what is the addition operator on two discrete distributions? They have to be of the same element type. The addition operator has to have the property that adding zero to any distribution is an identity, but what does it mean to add together two non-zero distributions?

Answer in the comments with your thoughts.


It turns out that there are some uses for an explicit empty distribution; we’ll discover what the specific benefits of it are in a later episode.

What are the costs? I don’t mean implementation costs, but rather, what are the down sides to developers of having this feature? In short: if we go down this road, what new opportunities for bugs are we producing?

One interesting cost is that we will defer an operation that can throw; this can be very confusing! A classic source of StackOverflow questions is when someone writes an enumerator block:

static IEnumerable<int> Foo(string bar)
{
  if (bar == null)
    throw new ArgumentNullException();
  yield return bar.Length;
  …
}

and then calls it:

var foo = Foo(accidentallyNullThing); // no throw
[…]
foreach (int x in foo) // throw!

The source of the problem is that the throw is delayed. If you look at the proper, industrial-strength implementations of Where, Select and so on, you’ll notice that each one is written in a style where it validates its arguments first, and then returns a call to a helper method that actually does the iteration. That way the exception is thrown close to the point of the mistake.

However, that doesn’t fix other common variations on the problem. For example, you might have some buggy code that produces an empty sequence sometimes, and then a thousand lines later you call First on the sequence and it blows up, but the bug is where the sequence is produced.

And of course this is really no different than nullable types that blow up when we forget that they can be null; a nullable T is logically a sequence of T where the sequence length is either zero or one, and if we forget that it can be “zero length”, we get into trouble.

The empty distribution will have the same property: it will be easy to create it by accident in a buggy program and it will not blow up until it is sampled, just as nullable reference types do not blow up until they are dereferenced.

That said, we’re going to do it because the benefits are actually pretty compelling, oddly enough.


Next time on FAIC: In the next regularly-scheduled episode we will implement the empty distribution; it’ll be quite straightforward, but it will necessitate fixing up some of our existing code. However, before then I’m going to interrupt this series with a very special episode that addresses a long-standing puzzler in probability theory which I just realized we now have all the gear we need to answer. Stay tuned!

 

Fixing Random, part 16

[Code is here.]


This series is getting quite long and we’re not done yet! This would be a good time to quickly review where we’re at:

  • We’re representing a particular discrete probability distribution P(A) over a small number of members of a particular type A by IDiscreteDistribution<A>.
  • We can condition a distribution — by discarding certain possibilities from it — with Where.
  • We can project a distribution from one type to another with Select.
  • A conditional probability P(B|A) — the probability of B given that some A is true — is represented as likelihood function of type Func<A, IDiscreteDistribution<B>>.
  • We can “bind” a likelihood function onto a prior distribution with SelectManyto produce a joint distribution.

These are all good results, and I hope you agree that we have already produced a much richer and more powerful abstraction over randomness than System.Random provides. But in today’s episode everything is really going to come together to reveal that we can use these tools to solve interesting problems in probabilistic inference.

To show how, we’ll need to start by reviewing Bayes’ Theorem.

If we have a prior P(A), and a likelihood P(B|A), we know that we can “bind” them together to form the joint distribution. That is, the probability of A and B both happening is the probability of A multiplied by the probability that B happens given that A has happened:

P(A&B) = P(A) P(B|A)

Obviously that goes the other way. If we have P(B) as our prior, and P(A|B) as our likelihood, then:

P(B&A) = P(B) P(A|B)

But A&B is the same as B&A, and things equal to the same are equal to each other. Therefore:

P(A) P(B|A) = P(B) P(A|B)

Let’s suppose that P(A) is our prior and P(B|A) is our likelihood. In the equation above the term P(A|B) is called the posterior and can be computed like this:

P(A|B) = P(A) P(B|A) / P(B)

I hope that is clear, but let’s move away from the abstract mathematics and illustrate an example by using the code we’ve written so far.

We can step back a few episodes and re-examine our prior and likelihood example for Frob Syndrome. Recall that this was a made-up study of a made-up condition which we believe may be linked to height. We’ll use the weights from the original episode.

That is to say: we have P(Height), we have likelihood function P(Severity|Height), and we wish to first compute the joint probability distribution P(Height&Severity):

var heights = new List<Height() { Tall, Medium, Short }
var prior = heights.ToWeighted(5, 2, 1);
[…]
IDiscreteDistribution<Severity> likelihood(Height h)
{
  switch(h)
{
    case Tall: return severity.ToWeighted(10110);
    case Medium: return severity.ToWeighted(0125);
    defaultreturn severity.ToWeighted(001);
}
}
[…]
var joint = prior.Joint(likelihood);      Console.WriteLine(joint.ShowWeights());

This produces:

(Tall, Severe):850
(Tall, Moderate):935
(Medium, Moderate):504
(Medium, Mild):210
(Short, Mild):357

Now the question is: what is the posterior, P(Height|Severity)? Remember what this is:  it is a function that takes a severity, and returns a distribution of heights.

We can compute the marginal probabilities “by hand” by looking at the weights above:

  • If symptoms are severe, there is a 100% chance that the person is tall.
  • If symptoms are moderate, 935 study members are tall for every 504 medium-height members.
  • If symptoms are mild, then that’s 210 medium people for every 357 short.

We could implement that easily enough; it’s just another function like we’ve seen many times before in this series:

IDiscreteDistribution<Height> posterior(Severity s)
{
  switch(s) … blah blah blah …

But I don’t want to have a human analyze the data and write the code. We have enough information in the IDiscreteDistribution<(Height, Severity)> to generate a Func<Severity<IDiscreteDistribution>.

In fact, we can simply add another clause to our query:

IDiscreteDistribution<Height> posterior(Severity s) => 
  from pair in joint
  where pair.s == s
  select pair.h;

We can compute the posterior with a Where clause!

Recall that what we are computing here is logically P(A&B)/P(B); just as SelectMany can be thought of as a sort of multiplication, apparently Where is logically a sort of division.

But let’s not stop here; we can make a general rule in the form of an extension method, and I’m going to slap a projection onto the back side of it just for added generality because why not:

public static Func<B, IDiscreteDistribution<C>> Posterior<A, B, C>(
    this IDiscreteDistribution<A> prior,
    Func<A, IDiscreteDistribution<B>> likelihood,
    Func<A, B, C> projection) =>
  b => from a in prior
       from bb in likelihood(a)
       where object.Equals(b, bb)
       select projection(a, b);
public static Func<BIDiscreteDistribution<A>> Posterior<AB>(
    this IDiscreteDistribution<A> prior,
    Func<AIDiscreteDistribution<B>> likelihood) =>
Posterior(prior, likelihood, (a, b) => a);

Let’s take it for a spin.

Question: Given the prior distribution and the likelihood function, what is the posterior distribution of height amongst the study members with moderate symptoms?

var posterior = prior.Posterior(likelihood);
Console.WriteLine(posterior(Moderate).ShowWeights());

And sure enough, we get a probability distribution out that matches what we could have computed by hand:

Tall:935
Medium:504

OK, that’s pretty neat, but why is this relevant?

Because Bayesian inference is incredibly important, and incredibly easy to get wrong! Anything we can do to improve developers’ ability to use Bayesian analysis correctly is a win.

Let’s look at another example. Almost a decade ago I did a blog post where I discussed how Bayesian inference is counterintuitive. Let’s run the numbers from that blog post through our system and see what we get.

We have a disease

enum TappetsDisease { Sick, Healthy }

and our prior is that 99% of the population is healthy:

var prior = new List<TappetsDisease> { Sick, Healthy }
  .ToWeighted(1, 99);

We also have a test:

enum JethroTest { Positive, Negative }

And the test is 99% accurate. That is, if you are sick, it has a 99% chance of “positive”, and if you are healthy, it has a 99% chance of “negative”:

var tests = new List<JethroTest> { Positive, Negative };
IDiscreteDistribution
<JethroTest> likelihood(TappetsDisease d) =>
  d == Sick ? tests.ToWeighted(99, 1) : tests.ToWeighted(1, 99);


Aside: You might wonder how we know that the test is 99% accurate, and how we know that 1% of the population has the disease, particularly given the counterintuitive result I’m about to discuss. That’s a great question and I’m not going to get into the details in this series of how in the real world medical practitioners evaluate the accuracy of a test or the prevalence of a condition. Let’s just suppose that we know these facts ahead of time; after all, that’s why the prior is called the prior.


Question: you have just tested positive; what is the probability that you have the disease?

Most people, and even many doctors, will say “the test is 99% accurate, you tested positive, therefore there is a 99% chance that you have the disease”. But that is not at all true; we can compute the true result very easily now:

var posterior = prior.Posterior(likelihood);
Console.WriteLine(posterior(Positive).ShowWeights());

And we get:

Sick:1
Healthy:1

It’s fifty-fifty.

Why?

If a result is confusing, always look at the joint distribution:

Console.WriteLine(prior.Joint(likelihood).ShowWeights());

(Sick, Positive):99
(Sick, Negative):1
(Healthy, Positive):99
(Healthy, Negative):9801

You tested positive. 99 out of every 10000 people are true positives, and 99 out of every 10000 people are false positives. We condition away the negatives, because you didn’t test negative, and what is left? 50% chance that you are positive, not 99%.


Aside: In this example if you test negative then you are not 99% likely to be negative; you are 99.99% likely to be negative! This is also counterintuitive to people.

Exercise: How good does the test have to be for you to have a 90% posterior probability of actually being positive given a positive result?


Bayesian inference is incredibly powerful and useful. We very frequently have good information on priors and likelihoods. We make observations of the world, and we need to figure out posteriors probabilities given those observations. I could list examples all day; a classic example in information technology is:

  • We can ask users to manually classify emails into spam and non-spam. That gives us a prior on P(Spam)
  • From that collection of spam and non-spam emails, we can find out which words are commonly found only in spam. That gives us a likelihood function, P(Words|Spam).
  • We then make an observation of a real email, and the question is: given the words in an email, what is the posterior probability that it is spam? That is, what is the function P(Spam|Words). If the probability passes some threshold, we can put the mail in the spam folder.

There are also real applications in sensor technology:

  • We have a machine in a factory which requires a part on a conveyor to stop moving before it is welded; we manually observe how often the part is stopped correctly, giving us a prior on P(Stopped)
  • We install a sensor that attempts to sense whether the part is stopped, and test its accuracy to obtain P(SensorReading|Stopped)
  • Now we have enough information to compute the posterior: given a certain reading from the sensor, what is the probability that the part has actually stopped moving? That is P(Stopped|SensorReading)
  • If we do not have a high enough probability that the part is actually stopped, we can delay the welding step until we have better evidence that the part has stopped.

There are even applications in developer tools!

  • We can gather information from open source repositories about how often certain functions are called, giving us a prior on P(Function called)
  • We can gather information from IDE keystrokes about how often a letter typed is ultimately the first letter of that function, giving us a likelihood function P(Keystrokes|Function called)
  • Now we have enough information to compute the posterior: given a certain set of recent keystrokes, what is the probability distribution on likely functions the user wishes to call? This could give us much better IntelliSense.

And so on. The opportunities for taking advantage of Bayesian inference are enormous. We really ought to have Bayesian inference on distributions in the basic toolbox of the language, the same way we have ints, doubles, strings, nullables, functions,  tasks, sequences, and so on, in that toolbox.

That’s what I mean by “Fixing Random”. The fundamental problem is not that Random has historically had a candy-machine interface; that’s just a silly historical accident that can be fixed. Rather: we’ve decided that monads like nullable, sequence, function and task are so important that they are included in the core runtime. Why? Not because they’re cool, but because having Nullable<T>, IEnumerable<T>,  Task<T>, and so on in the core runtime makes it much easier for developers to write correct, concise code that solves their problems.

Programming is increasingly about dealing with a world of unknowns; having operators in the language for concisely describing probabilistic workflows seems very valuable to me. This series seeks to make the case for that value.


Next time on FAIC: We’ll take a closer look at the discrete probability distribution type as a monad. We might be missing a concept.

Fixing Random, part 15

[Code is here.]


Last time on FAIC we made a correct, efficient implementation of SelectMany to bind a likelihood function and projection onto a prior, and gave a simple example. I deliberately chose “weird” numbers for all the weights; let’s do that same example again but with more “nice round number” weights:

var prior = new List<Height>() { Tall, Medium, Short }
  .ToWeighted(60, 30, 10);
[…]
IDiscreteDistribution<Severity> likelihood(Height h)
{
  switch(h)
  {
    case Tall: return severity.ToWeighted(45550);
    case Medium: return severity.ToWeighted(07030);
    defaultreturn severity.ToWeighted(001);
  }
}
[… projection as before…]
Console.WriteLine(prior
.SelectMany(likelihood, projection)
.ShowWeights());

This produces the output:

DoubleDose:270000
NormalDose:540000
HalfDose:190000

which is correct, but you notice how multiplying the weights during the SelectMany made for some unnecessarily large weights. If we then did another SelectMany on this thing, they’d get even larger, and we’d be getting into integer overflow territory.

Integer overflow is always possible in the system I’ve developed so far in this series, and I am deliberately glossing over this serious problem. A better implementation would either use doubles for weights, which have a much larger range, or arbitrary-precision integers, or arbitrary-precision rationals. I’m using integers to keep it simple, but as with many aspects of the code in this series, that would become problematic in a realistic implementation.

One thing we can do to tame this slightly is to reduce all the weights when possible; plainly in this case we could divide each of them by 10000 and have exactly the same distribution, so let’s do that. And just to make sure, I’m going to mitigate the problem in multiple places:

  • In SelectMany we could be taking the least common multiple (LCM) instead of the full product of the weights.
  • In the WeightedInteger factory we could be dividing out all the weights by their greatest common divisor (GCD).

Long-time readers of my blog may recall that I’ve implemented Euclid’s Algorithm before, but this time I’m going to make a much simpler implementation:

public static int GCD(int a, int b) => 
  b == 0 ? a : GCD(b, a % b);

We define the GCD of two non-negative integers a and b as:

  • if both zero, then zero
  • otherwise, if exactly one is zero, then the non-zero one
  • otherwise, the largest integer that divides both.

Exercise: Prove that this recursive implementation meets the above contract.


The problem we face though is that we have many weights and we wish to find the GCD of all of them. Fortunately, we can simply do an aggregation:

public static int GCD(this IEnumerable<int> numbers) => 
  numbers.Aggregate(GCD);

Similarly we can compute the LCM if we know the GCD:

public static int LCM(int a, int b) =>
  a * b / GCD(a, b);
public static int LCM(this IEnumerable<int> numbers) =>
numbers.Aggregate(1, LCM);

And now we can modify our WeightedInteger factory:

public static IDiscreteDistribution<int> Distribution(
  IEnumerable<int> weights)
{
  List<int> w = weights.ToList();
  int gcd = weights.GCD();
  for (int i = 0; i < w.Count; i += 1)
    w[i] /= gcd;
[…]

And our SelectMany:

int lcm = prior.Support()
  .Select(a => likelihood(a).TotalWeight())
  .LCM();
[… and then use the lcm in the query …]

See the code repository for all the details. If we apply all these changes then our results look much better…

DoubleDose:27
NormalDose:54
HalfDose:19

… and we are at least a little less likely to get into an integer overflow situation.


Aside: Of course we can do the same thing to the Bernoulli class, and normalize its weights as well.


Next time on FAIC: We can use the gear we’ve created so far to solve problems in Bayesian inference; we’ll see how.

Fixing Random, part 14

[Code is here.]


Last time on FAIC we achieved two major results in our effort to build better probability tools. First, we demonstrated that the SelectMany implementation which applies a likelihood function to a prior probability is the bind operation of the probability monad. Second, we gave an implementation of a wrapper object that implements it. It’s action can be summed up as:

  • sample from the prior distribution
  • use the likelihood function to get the conditional distribution
  • sample from the conditional distribution
  • run the projection on the pair of samples to get the result

You probably recall though that I did not implement the Weight function. It’s a little tricky to do so, for two reasons. First, I made the (now somewhat questionable!) decision to make weights integers. If the weights are fractions between 0.0 and 1.0, you can just multiply the weight of the prior sample by the weight of the conditional sample. (And if the weights are logarithms, you can just add them.) It’s trickier with integers. And second, the projection at the end introduces once again the possibility that there will be “collisions”; the projection could pick non-unique values for unique combinations of the samples, that then have to be weighted as the sum.

That’s all a little abstract, so let’s work an example.

Suppose we have a population of people who have been diagnosed with Frob Syndrome, which seems to be linked with height. We’ll divide the population of Frob Syndrome patients into three categories:

enum Height { Tall, Medium, Short }

and let’s suppose in our study population there are five tall people, two medium-height people, and one short person in every eight:

var prior = new List<Height>() TallMediumShort }
  .ToWeighted(5, 2, 1);

(To keep the code short on the page, suppose I have using static directives for each.)

Now let’s suppose we’ve done a survey of each of the tall, medium and short people to learn the severity of their symptoms:

enum Severity { Severe, Moderate, Mild }

At this point I’m going to make the numbers a bit odd to illustrate the mathematics more clearly.  What is the likelihood of a member of each group to report symptoms? Let’s say that 10 out of every 21 tall people report severe symptoms, and the remaining 11 report moderate symptoms. For the medium-height people, 12 out of 17 report moderate symptoms and 5 report mild symptoms. And all the short people report mild symptoms:

var severity = new List<Severity> SevereModerateMild };

IDiscreteDistribution
<Severity> likelihood(Height h)
{
  switch(h)
  {
    case Tall: return severity.ToWeighted(10, 11, 0);
    case Medium: return severity.ToWeighted(0, 12, 5);
    default: return severity.ToWeighted(0, 0, 1);
  }
}

And now let’s suppose we have a recommended prescription level:

enum Prescription { DoubleDose, NormalDose, HalfDose }

Taller people or people with more severe symptoms get a higher dose; shorter people or people with mild symptoms get a smaller dose:

Prescription projection(Height h, Severity s)
{
  switch (h)
  {
    case Tall: return s == Severe ? DoubleDose : NormalDose;
    case Medium return s == Mild ? HalfDose : NormalDose;
    default: return HalfDose;
  }
}

The question now is: what is the probability distribution on prescriptions for this study population?  That is, if we picked a random member of this population, how likely is it that they’d have a double, normal or half dose prescription?

IDiscreteDistribution<Prescription> doses =
  prior.SelectMany(likelihood, projection);

The problem is to work out the weightings of the three possible outcomes.

As I mentioned before, it’s easiest to do this when the weights are fractions because we can then just multiply them and then add them up:

Height        Severity           Prescription
Tall   (5/8)  Severe   (10/21)   DoubleDose (25/84)
Tall   (5/8)  Moderate (11/21)   NormalDose (55/168)
Medium (2/8)  Moderate (12/17)   NormalDose  (3/17)
Medium (2/8)  Mild      (5/17)   HalfDose    (5/68)
Short  (1/8)  Mild      (1/1)    HalfDose    (1/8)

(To save space I’ve elided the zero rows.)

So the probability of a member of this population getting a double dose is 25/84, getting a normal dose is 55/168 + 3/17 = 1439/2856, and getting a half dose is 5/68 + 1/8 = 27/136. Verifying that those add up to 1.0 is left as an exercise.

But we’re going to grit our teeth here and do it all in integers! How might we do that?

Well, we know how to eliminate fractions: multiply all the weights in the first column by 8, and all the weights in the second column by 21 * 17, and none of the proportions will change:

Height      Severity         Prescription
Tall   (5)  Severe   (170)   DoubleDose (850)
Tall   (5)  Moderate (187)   NormalDose (935)
Medium (2)  Moderate (252)   NormalDose (504)
Medium (2)  Mild     (105)   HalfDose   (210)
Short  (1)  Mild     (357)   HalfDose   (357)

So the integer weights are: double dose is 850, normal dose is 935 + 504 = 1439, and half dose is 210 + 357 = 567.

Let’s implement it!

First off, oddly enough there is a Sum() extension method but no Product() extension method, so let’s implement that:

public static int Product(this IEnumerable<int> items) =>
  items.Aggregate(1, (a, b) => a * b);

And I also need to know the total weight of a distribution:

public static int TotalWeight<T>(
    this IDiscreteDistribution<T> d) =>
  d.Support().Select(t => d.Weight(t)).Sum();

And now we can implement the algorithm I just sketched out:

int product = prior.Support()
  .Select(a => likelihood(a).TotalWeight())
  .Product();
var w = from h in prior.Support()
        let ps = likelihood(h)
        from s in ps.Support()
        group prior.Weight(h) ps.Weight(s) *
              product / ps.TotalWeight()
        by projection(h, s);
var dict = w.ToDictionary(g => g.Key, g => g.Sum());
var doses = dict.Keys.ToList();
var weights = dict.Values.ToList();

And sure enough, if we print those last two out:

DoubleDose, NormalDose, HalfDose
850, 1439, 567

Super, we can now work out the weights in our implementation of SelectMany.

But… wait a minute. Why do we have to?

That is, why do we need a Combined wrapper class for SelectMany at all?

We just worked out the weights of every member of the support, and we did so making no assumptions whatsoever about the prior or the likelihood function. We can delete our Combined wrapper class, and replace our implementation of SelectMany with:

public static IDiscreteDistribution<C> SelectMany<A, B, C>(
  this IDiscreteDistribution<A> prior,
  Func<A, IDiscreteDistribution<B>> likelihood,
  Func<A, B, C> projection)
{
  int product = prior.Support()
    .Select(a => likelihood(a).TotalWeight())
    .Product();
  var w = from a in prior.Support()
          let pb = likelihood(a)
          from b in pb.Support()
          group prior.Weight(a) * pb.Weight(b) *
            product / pb.TotalWeight()
          by projection(a, b);
  var dict = w.ToDictionary(g => g.Key, g => g.Sum());
  return dict.Keys.ToWeighted(dict.Values);
}


Exercise: Do you see any potential pitfalls in this implementation of computing the new weights? Give it some thought; I’ll give the answer in the next episode.


We do a small amount of math up front, and in exchange, we have computed the exact resulting probability distribution, which we can sample from efficiently. Just as we did with Where and Select​ in previous episodes.


Aside: Once again, if you trace through all the logic I’ve written so far you will quickly see that it is hugely inefficient in terms of the amount of re-computation it does and garbage it produces. If we were writing this for production code, we’d be a lot more aggressive about finding code paths that do re-computation and eliminating them. The point of this exercise is that our code produces correct, efficient distribution objects out the other end, even if it is a bit wasteful to do so in this particular pedagogic implementation.


Think about the power of this: you can write programs that treat discrete probability distributions over arbitrary types as values, the same way you’d treat integers, strings, sequences, or whatever, as values that obey a particular set of algebraic rules. We can project, condition and combine them together with the same ease that we do today with sequences, and sample from them to boot!

The idea that we can describe a probabilistic workflow, and have as the output of that workflow a new distribution semantically equivalent to the effect of the workflow, but without any long-running sample-and-reject loops due to the conditions, is called inference by the probabilistic programming community.

We’ve seen that we can do inference on arbitrary discrete distributions provided that the supports are small and the weights are small integers; as we’ll see throughout the rest of this series, the problem gets considerably harder as we abandon some of those simplifying assumptions.


Next time on FAIC: I’m going to implement a minor optimization in our weighted integer distribution. After that, we’ll put it all together to show how what we’ve developed so far can be used for Bayesian inference.