Porting old posts, part 1

Thanks again to the good people at Microsoft who have kept my old blog alive for now; my plan is to port the articles from the old site over, and then they will redirect from the old URLs to the new ones.

I’ve started the long process of porting old articles and it has been fun revisiting topics I haven’t thought about much for years.

I started this blog almost two years after I stopped working on scripting at Microsoft, but for the first several years, that’s almost all I posted about. The purpose of my blog was to get onto the web all the technical and design decisions we’d made about VBScript and JScript, and a lot of the details of my day job on Visual Studio Tools for Office did not translate well into the blog format.

IE is of course long gone, and now that Microsoft has switched to a Chromium-based browser for Edge, all of this stuff is now “for historical purposes only”; surely there are no developers left who need this information. Though that does make me slightly sad, I have to remember that software is plastic and we’re in the business of doing better than our past selves; sometimes that means starting afresh.

Anyways, as I port articles over I’ll post links to them here, with a few reflections.

Eric’s Complete Guide to BSTR Semantics
Eric’s Complete Guide to VT_DATE

My first posts, among the most popular of all time, and surprisingly still relevant.

Getting numerous links over the years from Raymond and others was a big help in kick-starting this blog. It’s also interesting in retrospect to realize that the standard format for allocating and deallocating strings in COM came from Visual Basic of all places; the B stands for “Basic”.

The VT_DATE format is horrid and should be destroyed by fire, but I think we’re still stuck with it.

What’s up with Hungarian Notation?
Bad Hungarian

The naming convention that everyone loves to hate is actually very useful if you use it correctly.

Why does JavaScript have rounding errors?

A classic FAQ that I got all the time when supporting the script engines.

As I update these articles, I’m going to try to use “JavaScript” to refer to the language proper, and “JScript” when I am speaking specifically about the Microsoft implementation from the mid 1990s.

Cannot use parentheses when calling a Sub
More on ByVal and ByRef
What are the VBScript semantics for object members?
Are JavaScript strings passed by value or by reference?
Are JavaScript strings passed by reference?

What does this error mean? It means that you cannot use parentheses when calling a sub! But boy, what a poor choice of syntax, that created so many questions.

The reason for the error is an oddity in the way that VBScript distinguishes between passing by value and passing by reference.

Understanding the difference between value semantics and reference semantics as they pertain to values, references and variables is a constant source of user questions. I wish we had done a better job early on of using “reference” to mean a reference to an object, and “alias” to mean a reference to a variable. But, we didn’t, and now we’re stuck with lots of beginner confusion.

That’s why this was a frequent topic in the early days of my blog, and came back again in the C# years.

This also marks the first of many quizzes and puzzles, to be answered in future episodes.

Smart pointers are too smart

The first really “controversial” post where I took a stand and wrote a rant. I stand by my opinion.

What are anonymous functions in JavaScript?
What are closures?

Functions without names, obviously. This is my first foray into writing about functional programming.

Why do the script engines not cache dispatch identifiers?
How do the script garbage collectors work?

These were the first posts where I really got into the implementation choices of a programming language.

This was also the first of many times I was told I was WRONG WRONG WRONG by Brendan Eich, the designer of JavaScript, later to be the CEO of Mozilla who stepped down after supporting anti-equality initiatives in California.

What are threading models?
How does ASP use the script engines?
Why is it a bad idea to put script objects in session scope?
Learnin’ about threadin’ the harrrrd way

The script engines use a wacky threading model, but it was built like that for reasons. However, if you use it wrong in ASP, you’ll quickly wreck the performance of your server as it tries to follow all the rules.

And if the developer of the components makes a boneheaded mistake, as I surely did, he learns all this stuff in a hurry.

Hard core denotational semantics

I discuss the highfalutin specification techniques used by the maintainers of the ECMAScript 4 specification; unfortunately, that version of the spec famously never shipped, and ECMAScript 5 went back to operational semantics.

Advertisements

Fixing Random, part 20

Without further ado, here’s my proposed stripped-down C# that could be a DSL for probabilistic workflows; as we’ll see, it is quite similar to both enumerator blocks from C# 2 and async/await from C# 5. (Code for this episode can be found here.)

  • The workflow must be a function with return type of IDiscreteDistribution<T> for some T.
  • Just as methods that use await must be marked async, our DSL methods must be marked probabilistic.
  • The function has an ordinary function body: a block of statements.

Obviously I am using C# 5 asynchronous workflows as an example of how to add a new mechanism to C#. The statements and the expressions in a probabilistic method are restricted as follows:

  • Declaration statements must have an initializer.
  • All the locals must have unique names throughout the method.
  • Assignments can only be in declarations or as statements; no use of assignments in the middle of an expression is allowed.
  • No compound assignments, increments or decrements.
  • No await, yield, try, throw, switch, while, do, break, continue, goto, fixed, lock, using or labeled statements allowed. What’s left? if and return statements are fine, as are blocks { }. I told you I was stripping things down!
  • No lambdas, anonymous functions or query comprehensions.
  • No use of in, out or ref.
  • All the other perfectly normal operations are allowed — function calls, object initializers, member access, indexers, arithmetic, that’s all just fine.
  • However, all function calls and whatnot must be pure; there can be no side effects, and nothing should throw.

Basically what I’m doing here is making a little super-simple subset of C# that doesn’t have any of the weird, hard stuff that keep compiler writers busy. In this pleasant world locals are always assigned, there are no closures, there are no worries about what happens when an exception is thrown, and all that sort of stuff.

This is pretty draconian, I know. We’ll sketch out how to soften some of those restrictions in later episodes. I want to show that we can describe how to implement a simple core of a language; harder features can come later.

To this little language I’m going to add a new unary operator and a new statement. The new operator is sample, and it may appear only as the right operand of an assignment, or the initializer of a declaration:

int x = sample WeightedInteger.Distribution(8, 2, 7, 3);

The operand must be of type IDiscreteDistribution<T>, and the type of the expression is T. Again, this should feel familiar if you’re used to await.

The new statement is condition, and it has the form

condition expression;

The expression must be convertible to bool. The meaning of this thing is much the same as Where: if the Boolean condition is not met, then a value is filtered out from the distribution by setting its weight to zero. But I’m getting ahead of myself.


Aside: Last time I talked a bit about how language designers must choose how general or specific a language element is, and that this must be reflected in syntax; this is a great example of such a choice.

I’ve chosen condition because I think of this operation as creating a conditional distribution; I could have said where instead of condition and had it mean basically the same thing, but that would be moving towards the established C# jargon for sequences, which I’m explicitly trying to avoid.

However, as we’ve seen, a primary usage case for a conditional distribution is computing the posterior distribution when given a prior and a likelihood. But why do we wish to compute the posterior at all? Typically because we have observed something. That is, the development cycle of the probabilistic program is:

  • Before the program is written, data scientists and developers compute priors, like “What percentage of emails are spam?”
  • They also compute likelihood functions: “what word combinations are likely, given that an email is spam? What word combinations are likely given that an email is not spam?”
  • A spam detection program must now answer the question: given that we have observed an email with certain word combinations, what is the posterior probability that it is spam? This is where the probabilistic workflow comes in.

For that reason, we could reasonably choose observation or observe as our keyword here, instead of condition, emphasizing to the developer “the reason you use this feature is to express the computation of a posterior distribution for a given observation”. That would make the feature feel a little bit less general, but might help more clearly express the desired semantics in the mind of the typical programmer designing a workflow that computes a posterior.

I’m going to stick with condition, but I just wanted to point out that this is a choice that has user experience consequences, were we actually to implement this feature in a language like C#.


Let’s look at some examples:

probabilistic IDiscreteDistribution<bool> Flip()
{
  int x = sample Bernoulli.Distribution(1, 1);
  return x == 0;
}

What I would like is for this method to have the exact same semantics as if I had written:

 IDiscreteDistribution<bool> Flip() => 
   Bernoulli.Distribution(1, 1).Select(x => x == 0);

What do you think about these two implementations? I think the former looks a lot more like the straightforward, imperative code that we’re used to.

Let’s look at another:

probabilistic IDiscreteDistribution<int> TwoDSix()
{
  var d = SDU.Distribution(1, 6);
  int x = sample d;
  int y = sample d;
  return x + y;
}

Again, it seems to me that this is more clear than

IDiscreteDistribution<int> TwoDSix() 
{
  var d = SDU.Distribution(1, 6);
  return from x in d from y in d select x + y;
}

And it certainly seems more clear, particularly to the novice, than

IDiscreteDistribution<int> TwoDSix() 
{
  var d = SDU.Distribution(1, 6);
  return d.SelectMany(x => d, (x, y) => x + y);
}

LINQ is awesome for sequences, but I think the statement-based workflow is much easier to understand for distributions.

Now let’s look at a really complicated workflow, this one with conditioning:

probabilistic IDiscreteDistribution<string> Workflow(int z)
{
  int ii = sample TwoDSix();
  if (ii == 2)
    return “two”;
  condition ii != z;
  bool b = sample Flip();
  return b ? “heads” : ii.ToString();
}

The first two were easy to see how they corresponded to query syntax, but what even is the distribution represented by this workflow? It depends on the value of the parameter z, for one thing.

What I want here is: when you call this method, you get a distribution back. When you Sample() that distribution, it should logically have the same effect as: all the sample operations Sample() their operand, and the returned value is, well, the returned value. However, if any condition is false then we abandon the current attempt and start over.

The trick is implementing those semantics without actually running the body in a loop!


Exercise: Pick a value for z; let’s say 3. See if you can work out what the distribution of strings is that should come out the other end of this thing. I’ll give the answer in the next episode.

Exercise: If you had to represent this workflow as a query, how would you do it?


Next time on FAIC: Suppose we were writing a compiler for this feature. We know that LINQ works by turning query comprehensions into function calls; we know that the compiler turns async workflows and iterator blocks build state-machine coroutines. How might we lower this new kind of code into ordinary C# 7 code? Probably somehow using our existing query operators… but what exactly would work?

 

 

 

Fixing Random, part 19

I’ve got no code for you this time; instead here are some musings about language design informed by our discussion so far.

One of the most important questions to ask when designing a language feature is: what should the balance be between generality and specificity of a feature?

My go-to example is LINQ, and what we’ve been talking about for the last dozen episodes illustrates the problem nicely.

As we’ve seen many times in this blog, and particularly in this series, query comprehensions allow us to concisely express projection (Select), binding (SelectMany) and use of the zero of an additive monad (Where) to express a condition. Because C# uses a syntactic transformation to implement LINQ, we can express these concepts on an instance of any monad that has the right methods. The compiler does not know or care whether the thing being selected from is a sequence, a probability distribution, whatever, doesn’t matter. All that matters is that a method with the right signature is available as an instance or extension method.

The language designers for C# 3.0 were therefore faced with a choice: how general should the LINQ feature be? There’s a spectrum of possibilities; a few points on that spectrum are:

  • We all know that a monad is just a monoid in the category of endofunctors, which implies that there are other patterns like “monoid” and “category” and “functor” that we could be describing in the type system; what’s so special about monads? We could create a whole “higher-kinded” type system to describe generic type patterns; “monad” is just one possible pattern.
  • We could embed monads as a special kind of thing right into the language, using the vocabulary of monads: bind, unit and so on. In this language, sequences are just a special case of the more general monad feature.
  • The feature could be made specifically about operations on sequences: not just Select, SelectMany and Where which make sense on multiple monads, but also operations that do not apply across monads, like GroupBy, Join and OrderBy. The operations could be applicable to any data source that supports those operations, whether XML documents or database tables or plain old lists, but the concepts in the API are tailored to the domain of data in the form of a sequence of things, but it has to be a sequence.
  • The feature could be narrowly tailored to the needs of a particular data technology; LINQ could have been simply allowing SQL Server query syntax to be embedded directly in the language, and it could have been made to work only with SQL Server and no other dialects of SQL or other database back ends.

The designers of C# chose a point on the spectrum slightly more general than the third point: the feature is not written in the jargon of monads, but it is more general than simple operations on sequences. The designers of Haskell chose maximum generality, and so on.

These are of course not the only options; there were any number of points on that spectrum where the language design could have fallen, and different language designers have chosen different points. Think about list comprehensions in Python, for instance; they are obviously similar to LINQ in many respects, but you’re not going to be marshaling a list comprehension to a database or using it as a generalized monadic bind operation any time soon. There’s nothing wrong with that; it was a reasonable choice for Python.

Our exploration of representing distributions as a monad and combining them using LINQ operators and comprehensions illustrates the pros and cons of C#’s choice. Though I think it is delightful that we can use Select, SelectMany and Where on probability distributions as though they were infinite sequences, and pleased that we can optimize the results using the laws of probability for discrete distributions, the language designer part of me feels like this is somehow wrong. We’re using the vocabulary of sequences, not the vocabulary of probability, and it feels out of place.

This in particular was brought into focus for me in my April 1st bonus episode. Though it was in parts deliberately obtuse and misleading, there was no real joke there; rather, I wished to illustrate a serious point with a silly example. Consider these two queries:

from car in doors
from you in doors
from monty in (
    from m in doors 
    where m != car 
    where m != you 
    select m)
select (car, you, monty)

and

from car in doors
from you in doors
from monty in doors 
where monty != car 
where monty != you 
select (car, you, monty)

Plainly we can refactor the first into the second if doors is a sequence; this program transformation preserves semantics for sequences. But if doors is a distribution then the refactoring preserves the support but changes the weights.

This is some evidence that query comprehensions are perhaps not the best way to express operations on distributions: our intuitions about what refactorings are correct is heavily influenced by our understanding of sequence semantics, and this could lead to bugs.

Moreover, think about how the design of C# has evolved with respect to monadic types:

  • C# 2 added nullable types and embedded nullable arithmetic and conversions in the language, but did not provide a general mechanism for lifting function calls and other operations to nullables until C# 6 added ?.
  • C# 2 added statement-based sequence workflows in the form of yield return.
  • C# 3 added query comprehensions for composing operations on sequences in an expression context
  • Reactive extensions (Rx) leveraged LINQ to support observables — push-based sequences — but did not add any new syntax; it could do this because the duality between pull- and push-based sequences is strong and our intuitions largely still give correct outcomes.
  • C# 5 added statement-based asynchronous workflows; though you can make async lambdas and use the await operator in the middle of an expression, asynchronous workflows are fundamentally collections of statements, not fluent compositions of expressions like query comprehensions.

You’ll notice that there is no overarching philosophy here; rather, as the needs of real-world developers evolve, the language evolves to represent different concepts at different levels of generality.

Async workflows could have been made to look just like Rx query expressions; after all, a task is logically an observable that pushes a single value instead of multiple values. But the designers of C# figured that it would be more natural for developers to mix asynchrony into their statement-based workflows. Similarly, nullables could be treated as sequences that contain zero or one value, but they’re not.

The lesson of C# here is: when you’re trying to represent a new concept, try creating a domain-specific subset of the language that solves the problem using the vocabulary of the problem, but with enough generality to be extensible in the future.

Could we do the same for stochastic workflows? Using LINQ as our representation has been cute, fun and educational, but as language designers we can recognize that real-world users would likely find this choice confusing and weird; distributions are like sequences in many ways, but they are also dissimilar in a lot of ways.

Can we come up with a domain-specific language subset that better matches the domain of probabilistic programming, but preserves (or improves upon) the nice properties that we’ve seen so far? We’ve seen that we can express a discrete-distribution workflow as a comprehension and automatically get an inferred distribution out the other end; could we do the same thing in a statement-based workflow language, akin to async/await?


Next time on FAIC: I’m going to sketch out a super-stripped down version of C# and add two new statements for probabilistic workflows. We’ll then show how that stripped-down version could be lowered to ordinary C# 7.

Fixing Random, part 18

Before that silly diversion I mentioned that we will be needing the empty distribution; today, we’ll implement it. It’s quite straightforward, as you’d expect. [Code for this episode is here.]

public sealed class Empty<T> : IDiscreteDistribution<T>
{
  public static readonly Empty<T> Distribution = new Empty<T>();
  private Empty() { }
  public T Sample() =>
    throw new Exception(“Cannot sample from empty distribution”);
  public IEnumerable<T> Support() =>
    Enumerable.Empty<T>();
  public int Weight(T t) => 0;
}

Easy peasy. Now that we have this, we can fix up our other distributions to use it. The WeightedInteger factory becomes:

public static IDiscreteDistribution<int> Distribution(
  IEnumerable<int> weights)
{
  List<int> w = weights.ToList();
  if (w.Any(x => x < 0))
    throw new ArgumentException();
  if (!w.Any(x => x > 0))
    return Empty<int>.Distribution;
[…]

And the Bernoulli factory becomes:

public static IDiscreteDistribution<int> Distribution(
  int zero, int one)
{
  if (zero < 0 || one < 0)
    throw new ArgumentException();
  if (zero == 0 && one == 0)
    return Empty<int>.Distribution;
[…]

And the StandardDiscreteUniform factory becomes:

public static IDiscreteDistribution<int> Distribution(
  int min, int max)
{
  if (min > max)
    return Empty<int>.Distribution;
[…]

And the Projected factory becomes:

public static IDiscreteDistribution<R> Distribution(
  IDiscreteDistribution<A> underlying, Func<A, R> projection)
{
  var result = new Projected<A, R>(underlying, projection);
  if (result.weights.Count == 0)
    return Empty<R>.Distribution;
[…]

And one more thing needs to change. Our computation in SelectMany assumed that none of the total weights are zero. Easily fixed:

int lcm = prior.Support()
  .Select(a => likelihood(a).TotalWeight())
  .Where(x => x != 0)
  .LCM();

We also have a division by total weight; don’t we have to worry about dividing by zero? Nope. Remember, the empty distribution’s support is the empty sequence, so when we then say:

var w = from a in prior.Support()
        let pb = likelihood(a)
        from b in pb.Support()
        group prior.Weight(a) * pb.Weight(b) *
          lcm / pb.TotalWeight()
        by projection(a, b);

If prior.Support() is empty then the whole query is empty and so the division is never executed. If prior.Support()is not empty but one of the pb.Support() is empty then there is nob from which to compute a group key. We never actually divide by total weight, and so there is no division by zero error to avoid.

That was relatively painless, but it is probably still very unclear why we’d ever need an empty distribution. It seems to be like a little bomb hiding in the program, waiting for someone to sample it. Have we just committed another “null reference” design fault? In a few episodes we’ll see what benefits justify the costs.


Next time on FAIC: We’ve been having a lot of fun treating distributions as monads that we can use query comprehensions on, but is that really the best possible syntax?

Fixing Random, bonus episode 1

I just thought of a really cute application of the stochastic workflow technology we’ve been working on; most of the series has already been written but it fits in here, so I’m going to insert this extra bonus episode. We’ll implement the zero value next time.

Code for this bonus episode is here.


You are probably familiar with the famous “Monty Hall” problem, but if not, here it is:

800px-Monty_hall_abc_tv.JPG

  • You’re on a game show hosted by Monty Hall, the handsome Canadian fellow pictured above.
  • Before you there are three closed doors; behind a uniformly randomly selected door there is a new car; behind the other two there is nothing.
  • You get to choose a door, and you get what is behind it as a prize: either a car, or nothing.
  • You randomly choose a door, again by some uniform process.
  • Monty — who knows where the car is — now always opens a door that meets two criteria: it does not have the car behind it, and it is not the door you chose.
  • To clarify: if you chose the door with the car, Monty chooses one of the remaining two doors by a uniform random choice. If you chose a door without the car, Monty only has one door he can open, and he opens that one.
  • Monty gives you the opportunity to switch your choice to the other still-closed door.
  • Assuming you wish to maximize your probability of winning the car, should you switch doors or stay with your original choice?

Aside: I’ve tried to be very precise in my description of the game for the purposes of our analysis. In the real game as played on television there were irrelevant details such as: the “prizes” behind the other two doors were goats or other bizarre, undesirable items, and so on. But there were also germane differences between the real game and our model above; for example, in the real game Monty would sometimes offer choices like “do you want to switch your door, or forget about the doors entirely and take the prize that is in this box?” and it is not clear by what process Monty decided to offer those choices. In this simplified version of the game I’ve removed all human agency from Monty; for our purposes, Monty is just a machine that is following an algorithm that involves generating random outcomes.


Exercise 1: If you don’t already know the solution, work it out. The answer is below.


 

.

.

.

.

.

You are two-to-one more likely to win the car if you switch than if you stay. But don’t take my word for it. Let’s solve the problem with computers, not by thinking!

Plainly the key to the problem is what is the distribution of Monty’s choice? Monty chooses a random door, but is observed to not pick a door with a car or the door which you picked. We can represent that as a two-parameter likelihood function:

IDiscreteDistribution<int> doors = SDU.Distribution(1, 3);
IDiscreteDistribution<int> Monty(int car, int you) =>
    from m in doors 
    where m != car 
    where m != you 
    select m;

There’s no logical difficulty in adding more parameters to a likelihood function; think of the parameters as a tuple if that makes you feel better.

Now we can answer the question. Here’s the probability distribution of winning if you do not switch:

var noSwitch1 = 
  from car in doors
  from you in doors
  from monty in Monty(car, you)
  select car == you ? “Win” : “Lose”;
Console.WriteLine(noSwitch1.ShowWeights());

And the output is:

Win:1
Lose:2

As predicted by thinking, you are twice as likely to lose if you do not switch. Computers for the win!


Exercise 2: Wait a minute, we never even used the value of range variable monty in the query. How is it possible that adding a from clause to the query changes its outcome when the sampled value is not even used?!?

Give this some thought and answer in the comments.


Exercise 3: OK smart person, if you thought that one was easy, take a look at this one.

We have our likelihood function Monty() which is just a query comprehension, and our noSwitch1 which is also just a query comprehension. We can make the program a little bit shorter by combining them together in the obvious way:

var noSwitch2 = 
  from car in doors
  from you in doors
  from monty in doors
  where monty != car 
  where monty != you 
  select car == you ? “Win” : “Lose”;

And if we print out the weights of that one… uh oh.

Win:1
Lose:1

I would have thought this program fragment to be logically the same as before, but this gives weights of 1:1 when we know the correct answer is 1:2.

Where did I go wrong?

Again, answer in the comments.


Next time on FAIC: Let’s get back on track from this silly diversion! We were talking about the zero value, so let’s implement it.

Fixing Random, part 17

Before we get going on today’s episode of FAIC, you might want to refresh your memory of what an additive monad is; I wrote an episode of my monad series on this subject. Briefly, an additive monad is a monad where there is a “zero value”; like the number zero, “multiplying” by zero produces a zero, and “adding” a zero is an identity.

For example, the sequence monad, IEnumerable<T>, has a zero: the empty sequence. If we Select or SelectMany from the empty sequence — our analog of “multiplying” — we get an empty sequence.  If we concatenate an empty sequence onto another sequence — the sequence analog of “adding” — we get the original sequence.

All additive monads can have a Where function defined on them; if we wanted to implement Where for sequences and didn’t care about performance, we could implement it like this:

public static IEnumerable<T> Single<T>(T t)
{
  yield return t;
}
public static IEnumerable<T> Zero<T>()
{
  yield break;
}
// Non-standard Where:
public static IEnumerable<T> Where<T>(
    this IEnumerable<T> items,
    Func<T, bool> predicate) =>
  from a in items
  from b in predicate(a) ? Single(a) : Zero<T>()
  select b;

That’s slow and produces hideous collection pressure, but it works; our actual implementation of Where is just an optimization.

What about the converse? Our probability monad IDiscreteDistribution<T> has a Where function defined. We definitely have a Singleton<T> type. But our implementation of the distribution monad does not appear to have a zero value. It seems plausible that there should be a way to express Where on distributions as we did with the sequence monad: as a SelectMany that produces either the single or zero distributions based on the predicate.

Give that some thought, and then scroll down when you think you have it worked out.

.

.

.

.

.

.

Just as the zero of the sequence monad is the empty sequence, the zero of the distribution monad is the empty distribution. That is, the distribution with an empty support that throws every time it is sampled.

We never implemented this value because every distribution class we’ve created already throws when you try to create an empty distribution:

  • StandardDiscreteInteger throws if the range is empty.
  • Bernoulli and WeightedInteger both throw if you give them all zero weights.
  • In our current implementation a Where clause where the predicate is false for everything in the support of the underlying collection will eventually throw.
  • In our original implementation, a Where clause where the predicate is always false hangs when sampled, but does not throw.
  • Our implementation of Select throws if the support is empty.
  • And so on.

Exercise: We have learned the following facts:

  • The zero value of the discrete distribution monad is the empty distribution.
  • The joint distribution produced by SelectMany is the analog of multiplication of two distributions.
  • Concatenation is the “addition” of the sequence monad. (The two sequences have to be of the same element type.)

I think it is pretty clear that doing a SelectMany on an empty distribution has to produce an empty distribution. But we still have a mystery to solve: what is the addition operator on two discrete distributions? They have to be of the same element type. The addition operator has to have the property that adding zero to any distribution is an identity, but what does it mean to add together two non-zero distributions?

Answer in the comments with your thoughts.


It turns out that there are some uses for an explicit empty distribution; we’ll discover what the specific benefits of it are in a later episode.

What are the costs? I don’t mean implementation costs, but rather, what are the down sides to developers of having this feature? In short: if we go down this road, what new opportunities for bugs are we producing?

One interesting cost is that we will defer an operation that can throw; this can be very confusing! A classic source of StackOverflow questions is when someone writes an enumerator block:

static IEnumerable<int> Foo(string bar)
{
  if (bar == null)
    throw new ArgumentNullException();
  yield return bar.Length;
  …
}

and then calls it:

var foo = Foo(accidentallyNullThing); // no throw
[…]
foreach (int x in foo) // throw!

The source of the problem is that the throw is delayed. If you look at the proper, industrial-strength implementations of Where, Select and so on, you’ll notice that each one is written in a style where it validates its arguments first, and then returns a call to a helper method that actually does the iteration. That way the exception is thrown close to the point of the mistake.

However, that doesn’t fix other common variations on the problem. For example, you might have some buggy code that produces an empty sequence sometimes, and then a thousand lines later you call First on the sequence and it blows up, but the bug is where the sequence is produced.

And of course this is really no different than nullable types that blow up when we forget that they can be null; a nullable T is logically a sequence of T where the sequence length is either zero or one, and if we forget that it can be “zero length”, we get into trouble.

The empty distribution will have the same property: it will be easy to create it by accident in a buggy program and it will not blow up until it is sampled, just as nullable reference types do not blow up until they are dereferenced.

That said, we’re going to do it because the benefits are actually pretty compelling, oddly enough.


Next time on FAIC: In the next regularly-scheduled episode we will implement the empty distribution; it’ll be quite straightforward, but it will necessitate fixing up some of our existing code. However, before then I’m going to interrupt this series with a very special episode that addresses a long-standing puzzler in probability theory which I just realized we now have all the gear we need to answer. Stay tuned!

 

Fixing Random, part 16

[Code is here.]


This series is getting quite long and we’re not done yet! This would be a good time to quickly review where we’re at:

  • We’re representing a particular discrete probability distribution P(A) over a small number of members of a particular type A by IDiscreteDistribution<A>.
  • We can condition a distribution — by discarding certain possibilities from it — with Where.
  • We can project a distribution from one type to another with Select.
  • A conditional probability P(B|A) — the probability of B given that some A is true — is represented as likelihood function of type Func<A, IDiscreteDistribution<B>>.
  • We can “bind” a likelihood function onto a prior distribution with SelectManyto produce a joint distribution.

These are all good results, and I hope you agree that we have already produced a much richer and more powerful abstraction over randomness than System.Random provides. But in today’s episode everything is really going to come together to reveal that we can use these tools to solve interesting problems in probabilistic inference.

To show how, we’ll need to start by reviewing Bayes’ Theorem.

If we have a prior P(A), and a likelihood P(B|A), we know that we can “bind” them together to form the joint distribution. That is, the probability of A and B both happening is the probability of A multiplied by the probability that B happens given that A has happened:

P(A&B) = P(A) P(B|A)

Obviously that goes the other way. If we have P(B) as our prior, and P(A|B) as our likelihood, then:

P(B&A) = P(B) P(A|B)

But A&B is the same as B&A, and things equal to the same are equal to each other. Therefore:

P(A) P(B|A) = P(B) P(A|B)

Let’s suppose that P(A) is our prior and P(B|A) is our likelihood. In the equation above the term P(A|B) is called the posterior and can be computed like this:

P(A|B) = P(A) P(B|A) / P(B)

I hope that is clear, but let’s move away from the abstract mathematics and illustrate an example by using the code we’ve written so far.

We can step back a few episodes and re-examine our prior and likelihood example for Frob Syndrome. Recall that this was a made-up study of a made-up condition which we believe may be linked to height. We’ll use the weights from the original episode.

That is to say: we have P(Height), we have likelihood function P(Severity|Height), and we wish to first compute the joint probability distribution P(Height&Severity):

var heights = new List<Height() { Tall, Medium, Short }
var prior = heights.ToWeighted(5, 2, 1);
[…]
IDiscreteDistribution<Severity> likelihood(Height h)
{
  switch(h)
{
    case Tall: return severity.ToWeighted(10110);
    case Medium: return severity.ToWeighted(0125);
    defaultreturn severity.ToWeighted(001);
}
}
[…]
var joint = prior.Joint(likelihood);      Console.WriteLine(joint.ShowWeights());

This produces:

(Tall, Severe):850
(Tall, Moderate):935
(Medium, Moderate):504
(Medium, Mild):210
(Short, Mild):357

Now the question is: what is the posterior, P(Height|Severity)? Remember what this is:  it is a function that takes a severity, and returns a distribution of heights.

We can compute the marginal probabilities “by hand” by looking at the weights above:

  • If symptoms are severe, there is a 100% chance that the person is tall.
  • If symptoms are moderate, 935 study members are tall for every 504 medium-height members.
  • If symptoms are mild, then that’s 210 medium people for every 357 short.

We could implement that easily enough; it’s just another function like we’ve seen many times before in this series:

IDiscreteDistribution<Height> posterior(Severity s)
{
  switch(s) … blah blah blah …

But I don’t want to have a human analyze the data and write the code. We have enough information in the IDiscreteDistribution<(Height, Severity)> to generate a Func<Severity<IDiscreteDistribution>.

In fact, we can simply add another clause to our query:

IDiscreteDistribution<Height> posterior(Severity s) => 
  from pair in joint
  where pair.s == s
  select pair.h;

We can compute the posterior with a Where clause!

Recall that what we are computing here is logically P(A&B)/P(B); just as SelectMany can be thought of as a sort of multiplication, apparently Where is logically a sort of division.

But let’s not stop here; we can make a general rule in the form of an extension method, and I’m going to slap a projection onto the back side of it just for added generality because why not:

public static Func<B, IDiscreteDistribution<C>> Posterior<A, B, C>(
    this IDiscreteDistribution<A> prior,
    Func<A, IDiscreteDistribution<B>> likelihood,
    Func<A, B, C> projection) =>
  b => from a in prior
       from bb in likelihood(a)
       where object.Equals(b, bb)
       select projection(a, b);
public static Func<BIDiscreteDistribution<A>> Posterior<AB>(
    this IDiscreteDistribution<A> prior,
    Func<AIDiscreteDistribution<B>> likelihood) =>
Posterior(prior, likelihood, (a, b) => a);

Let’s take it for a spin.

Question: Given the prior distribution and the likelihood function, what is the posterior distribution of height amongst the study members with moderate symptoms?

var posterior = prior.Posterior(likelihood);
Console.WriteLine(posterior(Moderate).ShowWeights());

And sure enough, we get a probability distribution out that matches what we could have computed by hand:

Tall:935
Medium:504

OK, that’s pretty neat, but why is this relevant?

Because Bayesian inference is incredibly important, and incredibly easy to get wrong! Anything we can do to improve developers’ ability to use Bayesian analysis correctly is a win.

Let’s look at another example. Almost a decade ago I did a blog post where I discussed how Bayesian inference is counterintuitive. Let’s run the numbers from that blog post through our system and see what we get.

We have a disease

enum TappetsDisease { Sick, Healthy }

and our prior is that 99% of the population is healthy:

var prior = new List<TappetsDisease> { Sick, Healthy }
  .ToWeighted(1, 99);

We also have a test:

enum JethroTest { Positive, Negative }

And the test is 99% accurate. That is, if you are sick, it has a 99% chance of “positive”, and if you are healthy, it has a 99% chance of “negative”:

var tests = new List<JethroTest> { Positive, Negative };
IDiscreteDistribution
<JethroTest> likelihood(TappetsDisease d) =>
  d == Sick ? tests.ToWeighted(99, 1) : tests.ToWeighted(1, 99);


Aside: You might wonder how we know that the test is 99% accurate, and how we know that 1% of the population has the disease, particularly given the counterintuitive result I’m about to discuss. That’s a great question and I’m not going to get into the details in this series of how in the real world medical practitioners evaluate the accuracy of a test or the prevalence of a condition. Let’s just suppose that we know these facts ahead of time; after all, that’s why the prior is called the prior.


Question: you have just tested positive; what is the probability that you have the disease?

Most people, and even many doctors, will say “the test is 99% accurate, you tested positive, therefore there is a 99% chance that you have the disease”. But that is not at all true; we can compute the true result very easily now:

var posterior = prior.Posterior(likelihood);
Console.WriteLine(posterior(Positive).ShowWeights());

And we get:

Sick:1
Healthy:1

It’s fifty-fifty.

Why?

If a result is confusing, always look at the joint distribution:

Console.WriteLine(prior.Joint(likelihood).ShowWeights());

(Sick, Positive):99
(Sick, Negative):1
(Healthy, Positive):99
(Healthy, Negative):9801

You tested positive. 99 out of every 10000 people are true positives, and 99 out of every 10000 people are false positives. We condition away the negatives, because you didn’t test negative, and what is left? 50% chance that you are positive, not 99%.


Aside: In this example if you test negative then you are not 99% likely to be negative; you are 99.99% likely to be negative! This is also counterintuitive to people.

Exercise: How good does the test have to be for you to have a 90% posterior probability of actually being positive given a positive result?


Bayesian inference is incredibly powerful and useful. We very frequently have good information on priors and likelihoods. We make observations of the world, and we need to figure out posteriors probabilities given those observations. I could list examples all day; a classic example in information technology is:

  • We can ask users to manually classify emails into spam and non-spam. That gives us a prior on P(Spam)
  • From that collection of spam and non-spam emails, we can find out which words are commonly found only in spam. That gives us a likelihood function, P(Words|Spam).
  • We then make an observation of a real email, and the question is: given the words in an email, what is the posterior probability that it is spam? That is, what is the function P(Spam|Words). If the probability passes some threshold, we can put the mail in the spam folder.

There are also real applications in sensor technology:

  • We have a machine in a factory which requires a part on a conveyor to stop moving before it is welded; we manually observe how often the part is stopped correctly, giving us a prior on P(Stopped)
  • We install a sensor that attempts to sense whether the part is stopped, and test its accuracy to obtain P(SensorReading|Stopped)
  • Now we have enough information to compute the posterior: given a certain reading from the sensor, what is the probability that the part has actually stopped moving? That is P(Stopped|SensorReading)
  • If we do not have a high enough probability that the part is actually stopped, we can delay the welding step until we have better evidence that the part has stopped.

There are even applications in developer tools!

  • We can gather information from open source repositories about how often certain functions are called, giving us a prior on P(Function called)
  • We can gather information from IDE keystrokes about how often a letter typed is ultimately the first letter of that function, giving us a likelihood function P(Keystrokes|Function called)
  • Now we have enough information to compute the posterior: given a certain set of recent keystrokes, what is the probability distribution on likely functions the user wishes to call? This could give us much better IntelliSense.

And so on. The opportunities for taking advantage of Bayesian inference are enormous. We really ought to have Bayesian inference on distributions in the basic toolbox of the language, the same way we have ints, doubles, strings, nullables, functions,  tasks, sequences, and so on, in that toolbox.

That’s what I mean by “Fixing Random”. The fundamental problem is not that Random has historically had a candy-machine interface; that’s just a silly historical accident that can be fixed. Rather: we’ve decided that monads like nullable, sequence, function and task are so important that they are included in the core runtime. Why? Not because they’re cool, but because having Nullable<T>, IEnumerable<T>,  Task<T>, and so on in the core runtime makes it much easier for developers to write correct, concise code that solves their problems.

Programming is increasingly about dealing with a world of unknowns; having operators in the language for concisely describing probabilistic workflows seems very valuable to me. This series seeks to make the case for that value.


Next time on FAIC: We’ll take a closer look at the discrete probability distribution type as a monad. We might be missing a concept.