# Fixing Random, part 26

We’ve been mostly looking at small, discrete distributions in this series, but we started this series by looking at continuous distributions. Now that we have some understanding of how to solve probability problems on simple discrete distributions and Markov processes, let’s go back to continuous distributions and see if we can apply some of these learnings to them. (Code for this episode can be found here.)

Let’s start with the basics. What exactly do we mean by a probability distribution function? So far in this series we’ve mostly looked at the discrete analog, a non-normalized probability mass function. That is, for an `IDiscreteDistribution<T>` we have a weight function that gives us a value for each `T`. The probability of sampling a particular `T` is the weight divided by the total weight of all `T`s.

Of course, had we decided to go with double weights instead of integers, we could have made a normalized probability mass function: that is, the “weight” of each particular `T` is automatically divided by the total weight, and we get the probability out. In this scenario, the total weight adds up to 1.0.

We know from our exploration of the weighted integer distribution that we can think of our probability distributions as making a rectangle where various sub-rectangles are associated with particular values; we then “throw a dart” at the rectangle to sample from the distribution; where it lands gives us the sample.

Continuous distributions can be thought of in much the same way. Suppose we have a function from double to double, always non-negative, such that the total area under the curve is 1.0. Here are some examples:

double PDF1(double x) => x < 0.0 | x >= 1.0 ? 0.0 : 1.0;
double PDF2(double x) => x < 0.0 | x >= 1.0 ? 0.0 : 2 * x;
double PDF3(double x) => Exp((x * x)) / Sqrt(PI);

What is the meaning of these as probability distributions? Plainly the “higher” the function, the “more likely” any particular value is, but what does it even mean to say that in our PDF2 and PDF3 distributions, that 0.5 is “less likely” than 0.6 but they are “equally likely” in our PDF1 distribution?

One way to think of it is that again, we “throw a dart” at the area under the curve. Given any subset of that area, the probability of the dart landing inside it is proportional to the area of the subset, and the value sampled is the x coordinate of the dart.

We can make this a little bit more formal by restricting our areas to little rectangles:

• Take a value, say 0.5.
• Now take a tiny offset, call it `ε`. Doesn’t matter what it is, so long as it is “pretty small”.
• The probability of getting a sample value between `0.5` and `0.5+ε` is `PDF(0.5)*``ε`. That is, the area of a thin rectangle.

That’s it! Let’s say we’ve got values 0.5 and 0.6, and an `ε` of 0.01. How do our three distributions compare with each other near these values?

• With `PDF1`, the probability of getting a sample between 0.5 and 0.51 is approximately 0.010, and the probability of getting a sample between 0.6 and 0.61 is also approximately 0.010.
• With `PDF2`, the probability of getting a sample between 0.5 and 0.51 is approximately 0.010, and the probability of getting a sample between 0.6 and 0.61 is approximately 0.012.
• With `PDF3`, the probability of getting a sample between 0.5 and 0.51 is approximately 0.006, and the probability of getting a sample between 0.6 and 0.61 is approximately 0.004

And moreover, the approximation becomes better as `ε` gets smaller.

Aside: What if we want to know the probability of getting a value between two doubles that are not “really close together”? In that case we can do “quadrature”: divide it up into a bunch of regions that are all “really small” and sum up the areas of all those little rectangles to get an approximation of the area. Or, we can use integral calculus to find the exact answer. But for the purposes of this series, we’re going to try to find ways to sample from a distribution that has a particular PDF without doing any quadrature approximations or any integral calculus.

All right, we understand what PDFs are. To summarize:

• A PDF is a pure function from double to double.
• A PDF is defined over the entire range of doubles.
• `PDF(x) >= 0.0` for all `x`.
• Integrating the PDF over the entire real line gives us a total area of 1.0.
• The probability that a sample is between `x` and `x+ε` is approximately `PDF(x)*``ε`, if `ε` is small.

For our purposes though I’m going to relax those conditions a bit. In the code I want to explore, we’ll assume that a “non-normalized” PDF has these properties:

• A PDF is a pure function from `T` to `double`​…
• … defined over the entire range of `T`.
• `PDF(t) >= 0.0` for all `t`.
• Integrating the PDF over all possible values of `T` gives us a total area of `A`, for some finite positive value `A`. That is, the distribution is not necessarily “normalized” to have an area of 1.0.  (You’d think that we can “easily” get a normalized distribution from a non-normalized distribution by dividing every weight by `A` but it is not always easy to know what that area is. But perhaps sometimes we do not need to know! Foreshadowing!)
• The probability that a sample is within `ε` of `t` is approximately `PDF(t)*ε/A` if `ε` is small. (Of course, as I noted above, we might not know the normalization constant.)

Aside: That last point is a little vague; what does it mean to be “within `ε` of some `t`” if `t` is not `double`? We’re going to just not worry about that. It’ll be fine. Particularly since we’re mostly going to look at situations where `T` is `double` anyways.

This becomes useful for cases like the multidimensional case where `T` is `(double, double)`, say, and by “within `ε`” we mean “inside a square of side `√ε`” or some such thing. Putting this all on a solid mathematical footing would take us too far off topic; you get the idea and let’s move on. I don’t know if we’ll get to multidimensional distributions in this series or not.

As we’ll see, continuous distributions are a much harder beast to tame than the small discrete distributions we’ve looked at before. However, there are still things we can do with them. Let’s start by making a definition in the type system for a distribution that has one of our relaxed PDFs. It’s awfully familiar:

public interface IWeightedDistribution<T> : IDistribution<T>
{
double Weight(T t);
}

This means that every discrete distribution gets to be a weighted distribution pretty much “for free”. Let’s implement that:

public interface IDiscreteDistribution<T> : IWeightedDistribution<T>{
IEnumerable<T> Support();
new int Weight(T t);
}

And now we have some minor taxes to pay; we’ll need to add

double IWeightedDistribution<R>.Weight(T t) => this.Weight(t);

to all our existing discrete distributions, but that is easily done.

We can also go back and fill in weight functions in our `Normal` class:

private static readonly double piroot = 1.0 / Sqrt(2 * PI);
public double Weight(double x) =>
Exp((x  μ) * (x  μ) / (2 * σ * σ)) * piroot / σ;

And our `StandardContinuousUniform` class:

public double Weight(double x) =>
0.0 <= x & x < 1.0 ? 1.0 : 0.0;

Easy peasy.

We’ve got a new variation on our old concept of weighting a distribution; let’s see where we can go with this.

Next time on FAIC: We know how to efficiently sample from any small, discrete distribution with integer weights, but distributions based on arbitrary PDFs are much harder beasts to tame. Given an arbitrary non-normalized PDF, can we sample from it?

# Fixing Random, part 25

Last time on FAIC we implemented the Markov process distribution, which is a distribution over state sequences, where the initial state and each subsequent state is random. There are lots of applications of Markov processes; a silly one that I’ve been wanted to do for literally decades is randomized text generation. (The silly code for this silly episode is here; bring your own corpus!)

In my youth I collected back issues of Scientific American, which I bought at the sadly now defunct Casablanca Books in my home town. I believe I sold all of them back when I moved out to Seattle; they were pretty bulky, though I might still have my autographed-by-Benoit-Mandelbrot copy of the issue about fractals somewhere in the house.

My favourite columns in old SciAms were of course the mathematical / programming games columns by Martin Gardner, Kee Dewdney and Douglas Hofstadter. The one germane to today’s topic was a Dewdney column about “Mark V. Shaney” — what we would now call a “bot”, that was running a Markov process generated by analysis of USENET posts, sampling from the resulting distribution to produce a “Markov chain” sequence (Mark V. Shaney, get it?) and posting the resulting generated text right back to USENET.

I’ve always wanted to replicate that, and today I have my chance. The basic technique is straightforward:

• Start with a corpus of texts.
• Break up the text into words.
• Group the words into sentences.
• Take the first word of each sentence and make a weighted distribution; the more times this word appears as a first word in a sentence, the higher it is weighted. This gives us our initial distribution.
• Take every word in every sentence. For each word, generate a distribution of the words which follow it in the corpus of sentences. For example, if we have “frog” in the corpus, then we make a distribution based on the words which follow: “prince” twice, “pond” ten times, and so on. This gives us the transition function.
• There are a number of ways to handle words which end sentences, but an easy way to do it is simply to have a bunch of “end words” where the transition function gives the empty distribution.
• Sample repeatedly, and format the sequences of words back into sentences.
• Hilarity!

OMG let’s do it.

All our probability distributions thus far have been immutable and we should continue that tradition. I’ll use the builder pattern to construct immutable objects. We’ll start by making a categorical distribution builder, and then make a Markov builder from that:

sealed class DistributionBuilder<T>
{
private Dictionary<T, int> weights = new Dictionary<T, int>();
public void Add(T t)
{
weights[t] = weights.GetValueOrDefault(t) + 1;
}
public IDistribution<T> ToDistribution()
{
var keys = weights.Keys.ToList();
var values = keys.Select(k => weights[k]);
return keys.ToWeighted(values);
}
}

Aside: Recall that we do not need to worry that we’ve closed over a member that is a mutable collection here, because `ToWeighted` is going to call `ToList` on the values.

Our Markov process builder is only slightly more complicated:

sealed class MarkovBuilder<T>
{
private DistributionBuilder<T> initial =
new DistributionBuilder<T>();
private Dictionary<T, DistributionBuilder<T>> transitions =
new Dictionary<T, DistributionBuilder<T>>();
public void AddInitial(T t)
{
}
public void AddTransition(T t1, T t2)
{
if (!transitions.ContainsKey(t1))
transitions[t1] = new DistributionBuilder<T>();
}
public Markov<T> ToDistribution()
{
var init = initial.ToDistribution();
var trans = transitions.ToDictionary(
kv => kv.Key,
kv => kv.Value.ToDistribution());
return Markov<T>.Distribution(
init,
k => trans.GetValueOrDefault(k, Empty<T>.Distribution));
}
}

Super! Now all we need is the complete works of Shakespeare, thank you Project Gutenberg. We don’t need to load the whole five megs into memory at once; we can process it line-at-a-time thanks to our builder. We’ll start by splitting up the line stream into a word stream:

static readonly char[] punct =
“<>,*-()[#]@:%\”/’;_&}”.ToCharArray();
static IEnumerable<string> Words(
this
IEnumerable<string> lines) =>
from line in lines
from word in line.Split()
let raw = word.Trim(punct)
where raw != “”
select raw;

Exercise: This is a pretty crappy word-splitting algorithm. Basically I’m stripping out any leading or trailing punctuation but I’m leaving in the periods, exclamation marks and question marks. This means that I’m stripping out the trailing single quotes in Shakespearean contractions like th’.

I’m also not canonicalizing the casing; I’ll be treating “The” and “the” as different words, and so on. This is not necessarily the best approach.

Can you come up with a better word-splitting algorithm than this one? This will do for my silly purposes, but it would be nice to have a better algorithm.

A generalized “split a sequence into subsequences” operator is a handy tool to have that is not included in the LINQ sequence operators; let’s make one:

public static IEnumerable<List<T>> Split<T>(
this IEnumerable<T> items,
Func<T, bool> last)
{
var list = new List<T>();
foreach (T item in items)
{
if (last(item))
{
yield return list;
list = new List<T>();
}
}
if (list.Any())
yield return list;
}

And now making a sentence-extractor is straightforward:

static IEnumerable<List<string>> Sentences(
this IEnumerable<string> words) =>
words.Split(w => {
char c = w[w.Length  1];
return c == ‘.’ || c == ‘!’ || c == ‘?’;
});

Again, not great; this would make “Mr.” the end of a sentence for instance. Notice again that I’m leaving the punctuation in the word. Perhaps not the best choice but it will work for our purposes.

Put it all together:

var builder = new MarkovBuilder<string>();
var sentences = File.ReadLines(“shakespeare.txt”)
.Words()
.Sentences();
foreach (var sentence in sentences)
{
for (int i = 0; i < sentence.Count  1; i += 1)
builder.AddTransition(sentence[i], sentence[i + 1]);
}
var markov = builder.ToDistribution();

Excellent. And now, release the Markovian monkey and we’ll see if Hamlet emerges:

Console.WriteLine(markov
.Samples()
.Take(100)
.Select(x => x.SpaceSeparated())
.NewlineSeparated());

Here’s a slightly-cleaned-up-for-legibility sample output; I’ve moved around the line breaks to make it easier to read.

SCENE I.

POET.

A very echo with him there. I say away!

HECTOR.

You know the body. Sir by my love him an end.
But your king? Sir John of a purse I grant
that thou brought forth in them with ears.
What’s the rebels with him. Well and others
orchards grew together.

BERTRAM.

Tis time his horse?

Dost thou seek how well but a shame this is
but with the Duke of Eve’s legacy be
the counter-gate which bears will stand
till he hath great world shall be a woman
with my ever is to cunning bade him
for recordation to death.

Exit GLOUCESTER

Re-enter Ghost.

CYMBELINE.

Grim-visag’d war but since thou turn the morn-dew
on this wretched soul elected him fair Helen
then let all our actors are you?

Now are both mine and said well his pitcher
by the heat of mine own desert?

What’s o’clock? I may not you do fight with you.
The mother breeds suspicion hath! Be soon shot.

Exeunt ACT III.

And I think I will leave it there.

It’s not Hamlet, to be sure, but it’s a start.

I also tried the complete works of Jane Austen:

The principal part of the result of your own line from any reply to the ecstasies of meeting he gravely but it in Hertfordshire and I mean to declare I have the days brought you must not. You must be happier than in Highbury with awful object as a greater steadiness had reached the picture of consequence to any objection to each other day came out.

Just… wow.

Next time on FAIC: We’ll turn our attention back to continuous distributions.

# Fixing Random, part 24

[Code for this episode is here.]

So far in this series we’ve very briefly looked at continuous distributions on doubles, and spent a lot of time looking at discrete distributions with small supports. Let’s take a look at a completely different kind of distribution that doesn’t easily fit into our “small discrete” bucket but isn’t continuous either. Some motivating scenarios:

• Imagine you find yourself placed at random in a city and you want to take a stroll. At every intersection you make a random choice of which direction to go next. The random choice of starting points has a particular distribution, and the distribution of possible direction choices could be different at each intersection. The result of each stroll is a path through the city. What is the distribution of possible strolls?
• Imagine you find yourself at a random Wikipedia/tvtropes/whatever page, and you want to go down that rabbit hole. On each page you randomly select a link to another page within the site (if there is one.) What is the distribution of possible sequences of pages?
• Imagine you are a gambler who starts with a certain random amount of cash in your pocket and a target for how much you want to win and how much you are willing to lose. If you’ve won or lost too much, you go home; otherwise, you make a random-sized bet on a slot machine with certain odds, where the distribution of bet sizes and machine choices is a function of how much money you have in your pocket right now. What is the distribution of possible sequences of cash in your pocket?

Each of these cases has a number of things in common:

• We start in a randomly chosen state.
• We randomly move to a new state depending solely on our current state, and not on any “memory” of how we got to that state.
• The new state might be a “stop state”, in which case the sequence of states is over.
• The question is: what is the distribution of possible state sequences?

A random process that has these properties is called a Markov process, after Russian mathematician Andrey Markov. Such processes are extremely useful and have been heavily studied for the last hundred years or so.

Aside: For this series I’m only going to look at Markov processes where each “state change” happens at no particular time; all we care about is the sequence. There is a whole subfield of studying Markov processes where we care about how long the gap is between events, and what the distribution of those gaps is; I’m not going to go there.

Aside: Note that there is no requirement in a Markov process that any of the distributions involved be discrete. The examples I gave all involved discrete quantities: intersections, web pages, dollar amounts. But you could certainly do a Markov process where you randomly chose a point on a line via some continuous distribution, and then chose the next point based on a continuous distribution that is a function of the current position. We will look at examples of such in later episodes; for now though we’ll stick to discrete examples.

Aside: Note that there is no requirement whatsoever that the sequence sampled from a Markov process be finite! It can be finite or infinite, depending on the details of the process.

Aside: Suppose we produce a sequence of dice rolls. We roll a die, and no matter what we rolled, we roll it again, every time. Technically that is a Markov process; the next roll depends on nothing, not even the previous roll. We will not be considering such “degenerate” Markov processes in this series; we’re interested in processes where the distribution of next states depends on the current state.

Let’s temporarily leave aside our discrete weighted distributions and go back to our underlying unweighted `IDistribution<T>` interface for this one. (Working out the “weight” associated with a particular sequence is an interesting problem, but not one that we’re going to cover in this series.) Here we have a distribution on sequences of `T`:

sealed class Markov<T> : IDistribution<IEnumerable<T>>
{
private readonly IDistribution<T> initial;
private readonly Func<T, IDistribution<T>> transition;
public static Markov<T> Distribution(
IDistribution<T> initial,
Func<T, IDistribution<T>> transition) =>
new Markov<T>(initial, transition);
private Markov(
IDistribution<T> initial,
Func<T, IDistribution<T>> transition)
{
this.initial = initial;
this.transition = transition;
}
public IEnumerable<T> Sample()
{
var current = this.initial;
while(true)
{
if (current is Empty<T>)
break;
var s = current.Sample();
yield return s;
current = this.transition(s);
}
}
}

Aside: You probably noticed that the state transition function is just another likelihood function: given the current state, it tells you what next states are most likely.

Let’s look at a simplified version of our “gambler’s ruin” example.

• Our state is the current amount of money we have: an integer.
• We need a distribution on initial conditions. Let’s say we’ll start off with exactly `n` dollars every time, so that’s a singleton distribution.
• If the state ever gets to zero or to twice `n`, we quit. We represent quitting by producing an empty distribution. We cannot produce any more states because we cannot sample from the empty distribution!
• If we are not in a quitting state then we flip a coin. This causes our state to transition to either the current bankroll minus one, or plus one.

const int n = 5;
var initial = Singleton<int>.Distribution(n);
IDistribution<int> transition(int x) =>
(x == 0 || x == 2 * n) ?
Empty<int>.Distribution :
Flip().Select(b => b ? x  1 : x + 1);

All right, let’s take it for a spin:

var markov = Markov<int>.Distribution(initial, transition);
Console.WriteLine(markov.Sample().CommaSeparated());

The first time I ran this I got:

```5,4,3,2,3,4,5,4,3,4,3,4,5,4,3,4,5,6,5,
6,5,6,5,6,5,6,5,4,5,6,7,8,9,8,9,8,9,10```

It took 38 flips to get from 5 to 10. (And we ended up back at the start nine times!)

Exercise: Should the length of this sequence surprise you? Is it longer than average, shorter than average, or does it seem like it should be a pretty typical sample from this distribution? The solution is below.

.

.

.

.

.

Let’s think about the possible lengths of sequences. A few things come to mind:

• The shortest possible games are 5, 4, 3, 2, 1, 0 and 5, 6, 7, 8, 9, 10, so we won’t expect to see anything below six.
• Every game must have an even number of items in the sequence; do you see why?
• The maximum length is… well, there is no maximum length. We could go 5, 6, 5, 6, 5, 6, … arbitrarily long before heading off to 10 or 0.

The easiest way to get a handle on typical sequence lengths to just draw the histogram:

markov.Samples().Select(x => x.Count()).DiscreteHistogram()

```  6|********************************
8|**************************************
10|****************************************
12|************************************
14|********************************
16|*****************************
18|****************************
20|**************************
22|***********************
24|*******************
26|*******************
28|******************
30|****************
32|***************
34|***********
36|**********
38|**********
40|********
42|********
44|********
46|******
48|******
50|******
52|****
54|****
56|***
58|****
60|***
62|**
64|**
66|**
68|***
70|**
72|*
```

I cut it off there; the “long tail” in this particular run actually went out to 208!

We can see from the histogram that the mode is ten flips and most games end in fewer than 38 flips, but there is still a substantial percentage that take longer. We could work out exactly how much, but I think I will leave it there for now. You get the idea.

We seem to be doing pretty well here; we’ve created an entire new class of distributions that we can explore using fluent programming in only a few lines of code.

Next time on FAIC: It takes only one Markovian monkey to produce an approximation of Hamlet.

# Fixing Random, part 23

So… I’ve got good news and bad news.

The good news is:

• I’ve described an interface for discrete probability distributions and implemented several distributions.
• I’ve shown how projecting a distribution is logically equivalent to the LINQ `Select` operator.
• I’ve shown how conditional probabilities of the form P(B|A) “probability of B given A” are likelihood functions.
• I’ve shown that combining a prior with a likelihood to form a joint distribution is the application of the monadic bind operator on the probability monad, and implemented this as `SelectMany`.
• I’ve shown that computing the posterior distribution from the joint distribution can be done by applying a `Where` clause.
• I’ve sketched out a statement-based workflow DSL that can be compiled down to our LINQ operators, and that lowering and then executing programs written in this DSL can compute a “sampling-loop-free” distribution automatically.
• And that distribution object can then be efficiently sampled.

This is all great, if I do say so myself. But the bad news is that there are some problems; let’s list them in order roughly from smallest to largest:

First problem: It’s often inconvenient to express weights as integers, particularly when you end up with largish coprime weights; we’re risking integer overflow when we multiply or add them, and we lack facilities for common operations like “make me a Bernoulli distribution based on this arbitrary double”.

That criticism is very valid, but the problem is totally fixable. I chose integer weights because it is easy to get integer arithmetic right in simple cases and we do not require any additional library support. A realistic implementation of a weighted discrete distribution would probably choose arbitrary-precision integers, arbitrary-precision rationals or doubles depending on its needs.

Using doubles slightly complicates the math because you either have to make sure that probabilities add up to 1.0 even when double representation error bites you, or you have to deal with situations where the total weight is not 1.0. The latter problem is not so bad; we can always divide through by the total weight if we need to. We’ll return to the more general problem of dealing with double-weighted distributions that do not add up to 1.0 later in this series.

Aside: Since probabilities are always between 0 and 1, and often very small, a common choice is to take the log of the probability as the weight. In that scheme we can implement multiplying two probabilities together as adding, which is nice, but lots of other operations get harder. This is also a nice representation if you’re doing math that involves computations of entropy, as entropy is defined as the negative log of the mass function. We won’t go there in this series.

Second problem: The basic idea of both our LINQ-based and statement-based implementations is to take a workflow that might contain conditions — `where` clauses, in the query form, `condition` statements in the statement form — and infer a distribution object that can be sampled from efficiently.

But our technique for that is for `SelectMany` to produce every possible outcome, and then group them to determine the weights! For example, suppose we had a workflow like

var d10 = SDU.Distribution(1, 10);
var sum = from a in d10
from b in d10
from c in d10
from d in d10
let s = a + b + c + d
where s >= 20
select s;

If these were sequences instead of distributions, we’d know what happens here: we enumerate all ten thousand possible combinations, run a filter over their sum, and then yield a whole sequence of sums that match the filter.

But that’s almost exactly what we’re doing with the distributions too! Our query here is basically a syntactic sugar for

var d10 = SDU.Distribution(1, 10);
var sum = from a in d10.Support()
from b in d10.Support()
from c in d10.Support()
from d in d10.Support()
let s = a + b + c + d
where s >= 20
group s …

and you know how this goes from our implementation of `SelectMany`on distributions: we weight each member, group them by outcome, sum up the weights, and so on. The math at the end isn’t the expensive part; the number of nested loops we go through to get to that math is!

Now imagine if this were a much more complex workflow, whether written in our query form or our statement form. Inferring the result of the workflow basically means evaluating every possible value for every local/range variable and recording the result. If there are lots of possible values for lots of variables, we could end up exploring millions or billions of possibilities, some of which are very low probability.

Now, the benefit here is of course that we only do that exploration once, and the probability distribution that comes out the other end is then cheap to use forever; we can memoize it, because by assumption everything is pure. But still, it seems odd that we’re implementing inference by running every possible path through the workflow and recording the results; it seems like we could trade off a small amount of inaccuracy for a great improvement in speed.

Put another way: in our original naive implementations of `Where` and `SelectMany` we didn’t do any inference; we just sampled the underlying distributions and lived with the fact that `Where` could loop a lot. Compare constructing the naive implementation to constructing our current implementation:

• Constructing the naive version of the distribution is O(n) in the size of the number of query operators. In this case, we’ve got a tiny handful of `SelectMany`s and a `Where`. The number of query operators is usually very small.
• Constructing the exact distribution is O(n) in the number of possible combinations of all variables, in this case, ten thousand. Those ten thousand computations then produce a correct weighted integer distribution.

And compare sampling:

• Sampling the naive version would ultimately do four samples from d10, add them up, and reject the resulting sum a little bit less than half of the time. On average we would do around seven or eight samples of underlying distributions per one sample of the sum distribution.
• Sampling the exact version is just sampling from a weighted integer, which we know does exactly two samples from its implementation distributions.

And compare knowledge of the weights:

• The naive version can compute weights cheaply for `Where` conditioned distributions, but not distributions that have `SelectMany` clauses; it’s computing the weights that is expensive.
• The exact version computes all weights exactly for all workflows.

Which way is the win? Consider these scenarios:

• We’re never going to call `Sample()`, but we want an accurate inference of the true weights of outcomes given a workflow. When we’re computing the posterior distribution of “is this email a spam?” we don’t care to ever sample from that distribution; we want to know is it 99% or 1%?
• We’re going to sample millions or billions of times, so keeping the per-sample cost as low as possible is important.
• We are sampling and a `Where` clause rejects the vast majority of possible outcomes, causing long waits for a sample in the naive case.

In those three scenarios, our current “exact but expensive inference” step is clearly better. But if those conditions are not met, then the current version is doing a lot of up-front work in order to save possibly very little work later.

Summing up:

• It is not immediately obvious where to draw the line that indicates whether the inference optimization’s high up-front cost is worth the benefit of cheap sampling later.
• It seems plausible that we could trade off a small amount of inaccuracy in the inference for large speed improvements in inference.

Third problem: the biggest problem with this system is: in practice it only works for discrete distributions with small supports!

I started this series by making a general-purpose distribution type for discrete or continuous distributions, waved my hands a bit about continuous distributions, and then abandoned them entirely. But they are important!

Moreover, we can’t easily work with discrete distributions where any integer is a possible output, like the geometric distribution or the Poisson distribution. Sure, we can approximate them by restricting their ranges to something reasonable, but it would be nice to be able to just have an object that represents a Poisson distribution and not have to worry about the repercussions of the fact that technically, the support is any positive integer.

Summing up:

• I’ve spent the last 22 episodes of this series solving the easiest possible problems in this space. All the operations we care about on “categorical distributions” admit easy exact solutions to the inference problem.
• If you take away one thing from this series, well, it should be that `System.Random` is way too simple for the needs of modern programmers dealing with stochastic systems. By making a couple simple interfaces and implementations, we can represent very rich probabilistic workflows, compose them, and do math automatically.
• If you take away two things, the second thing should be that sampling from a given distribution efficiently is a really hard problem. By restricting the problem enormously, we can make good progress on it, but once we leave the realm of tiny discrete distributions, we will be in some pretty deep water.

We’ve taken small discrete distributions as far as I’m going to go in this series. I’m going to take a short break to visit my family for Easter, and then we’ll pick up when I get back. I’ll spend the remainder of this series sketching out some ways we can use the ideas we’ve developed so far to tackle harder problems in stochastic programming, but will not go into as much depth as we have so far.

Next time on FAIC: How can we use these techniques to model Markov processes?

# Fixing Random, part 22

[Code for this episode is here.]

Last time in this series I left you with several challenges for improving our DSL for imperative probabilistic workflows. But first, a puzzle:

Question One: You are walking down the street when you see a can of gasoline, some matches, a house on fire, a wrench, a fire hose and a fire hydrant. You consider the house being on fire to be problematic, so what do you do?

.

.

.

.

Of course you connect the hose to the fire hydrant with the wrench and use the hose to extinguish the house.

Question Two: You are walking down the street when you see a can of gasoline, some matches, a house not on fire, a wrench, a fire hose and a fire hydrant. What do you do?

.

.

.

Set the house on fire with the gasoline and the matches; we have now reduced the problem to a previously solved problem, so we’re done.

That’s how compiler writers approach problems. We know how to do lowerings in a very simplified version of probabilistic C#; to lower more complicated forms, we just keep on applying the “lower it to a previously solved problem” technique:

• `goto` is straightforward: you just evaluate the “statement method” that is the statement you’re going to.
• Once you have `goto` and `if`, it is straightforward to implement `do`, `while`, `break` and `continue`; just reduce those to combinations of `if` and `goto`, which are previously solved problems.
• `for` loops are just an irritatingly complicated version of `while`, which is a previously solved problem.
• Implementing `throw` and `try` is tricky because it is a non-local `goto`. I might come back in a later episode and describe ways of taming `throw` and `try` in these sorts of workflows. For some thoughts on why this is tricky, see my past episodes on problems with iterator blocks. Recall also that use of various kinds of `try` blocks was originally restricted for async methods, for similar reasons.
• We have not solved the problem for `try`, which means that we’re blocked on `foreach``lock` and `using`, all of which are syntactic sugars for `try`
• To allow the `sample` operator in more places, again, reduce the problem to previously solved problems. For example:

x = Foo(sample A(), sample B()[sample C()]) + sample D();

is equivalent to

var a = sample A();
var b = sample B();
var c = sample C();
var f = Foo(a, b[c]);
var d = sample D();
x = f + d;

And we already know how to lower all of those.

• You’ve got to be careful with the operators that conditionally evaluate their operands (`&&` `||` `??` `?:`), but you can lower those into `if` statements.
• Assignments, increment and decrements used in expression locations can be handled with the same trick: turn it into a form that has only statements that we already know how to do.
• For lambdas and anonymous methods, we’ve already stated up front that they need to be pure, which means that they should not be closed over values which change. Given that restriction, lambdas (and therefore also query expressions) can be allowed.
• Moreover, we know how to turn lambdas into ordinary methods; if we apply the same restrictions to a lambda body as we do to probabilistic methods then we could allow probabilistic lambdas; that is, lambdas that return probability distributions and contain `sample` operators and `condition` statements. This is much the same as the way C# allows async lambdas.

Aside: If actually implementing this scheme sounds super complicated with lots of details to keep track of and performance pitfalls everywhere, congratulations, you now know what it is like to be a compiler developer!

Compiler developers have to do all of this stuff even without fancy kinds of workflows in the language; everything that I am describing is a problem that has to be solved during the codegen phase of a compiler, because IL does not have concepts like `while` or `for` or “assignment as an expression”. It’s all about reducing programs to simpler and simpler forms, until you get it down to the IL.

Let’s assume for now that we can make all those basic enhancements to our imperative DSL, and address the question I left you with last time: is this sugar for queries just a sugar, or is there something we can do with it that we can’t do with a query?

Yes and no.

As we’ve seen, ultimately everything gets desugared into calls to `SelectMany`, so no. Anything you can do with this notation, you can do with a `SelectMany`, and therefore you could write a query comprehension. However, this notation certainly makes it much easier to clearly and concisely represent operations that would look pretty odd in a query.

For example, we don’t normally think of queries as being recursive, but recursive workflows are straightforward in this DSL, so maybe there is an argument that yes, there is some more expressiveness in the statement form.

How’s that, you say? Let’s look at an example of a recursive workflow.

Let’s play a game in `n` rounds; we’ll state up front what `n` is. You flip a coin. If it is heads, I pay you `n` dollars and we’re done. If it is tails, we go to the next round. In the next round, you flip a coin. If it is heads, I pay you `n-1` dollars, if tails, we go to the next round, and so on. If we get down to zero dollars, the game is over, and you get nothing. (The recursion has to stop somewhere!)

Aside: Obviously we do not have to implement this using recursion; we could use a loop. My point is that we easily can implement it with recursion, so recursion is a tool we have available in our toolbox when building probabilistic workflows.

Exercise: what would a fair price be for you to pay me up front so that the game is on average a net gain for neither you nor me? The answer is below.

I’m going to assume that we’ve made all the straightforward improvements to our DSL, including allowing probabilistic expression-bodied methods. We can implement the logic of our game like this:

probabilistic IDiscreteDistribution<int> Game(int rounds) =>
(rounds <= 0 || sample Flip()) ?
rounds :
sample Game(rounds  1);

We can do an initial lowering to a simpler form:

probabilistic IDiscreteDistribution<int> Game(int rounds)
{
S0: if (rounds <= 0) goto S5;
S1: x = sample Flip();
S2: if (x) goto S5;
S3: y = sample Game(rounds  1)
S4: return y;
S5: return rounds;
}

And then lower that form into actual runnable C# 7. (I’ll elide the unnecessary locals and temporaries.)

static IDiscreteDistribution<int> Game(int rounds)
{
IDiscreteDistribution<int> S0() =>
rounds <= 0 ? S5() : S1();
IDiscreteDistribution<int> S1() =>
Flip().SelectMany(x => S2(x));
IDiscreteDistribution<int> S2(bool x) =>
x ? S5() : S3();
IDiscreteDistribution<int> S3() =>
Game(rounds  1).SelectMany(y => S4(y));
IDiscreteDistribution<int> S4(int y) =>
Singleton<int>.Distribution(y);
IDiscreteDistribution<int> S5() =>
Singleton<int>.Distribution(rounds);
return S0();
}

(Inlining that into a query is left as an exercise.)

I asked above what is the fair price to pay; that’s the expected value. We can work that out directly from the distribution:

public static double ExpectedValue(
this IDiscreteDistribution<int> d) =>
d.Support().Select(s =>
(double)s * d.Weight(s)).Sum() / d.TotalWeight();

And if we run it, we get the answers we seek:

Console.WriteLine(Game(5).ShowWeights());
Console.WriteLine(Game(5).ExpectedValue());

```5:16
4:8
3:4
2:2
1:1
0:1
4.03125```

You, who know probability theory, could have predicted the results: out of a large set of games, on average you get paid 5 half the time, 4 a quarter of the time, and so on.

If you paid me \$4.00 for the chance to play this game in five rounds, on average you’d come out ahead by a few cents. Half the time you’d net a dollar, a quarter of the time you’d break even, and the rest of the time you’d lose some or all of your money.

Again the thing that is so exciting about this technique is: we humans didn’t need to work out either the weights or the expected value; we wrote a program describing the game and executing that program produced the solution for the exact distribution of the results!

In a previous series I showed how re-implementing algorithms that use doubles to instead use dual numbers can make the runtime compute the exact value of the derivative of a function at a point while it is computing the value of the function at that point. We no longer have to manually do the work to compute a derivative, and nor do we have to approximate the derivative using numerical methods, and nor do we have to do symbolic differentiation; we just imperatively describe how the value is computed, and the runtime does a little extra work along the way to exactly compute the derivative.

What I’ve shown here is that we can do pretty much the same thing for small discrete probability distributions! We don’t have to work out what the distribution of this silly game is. Nor do we have to simulate the game, collect a hundred thousand data points, and make a statistical approximation of the distribution. We just describe the flow of the game in imperative code, run the method, and the thing that pops out the other end is the exact distribution!  It’s almost magical.

Next time on FAIC: It can’t really be that easy, can it?

# Porting old posts, part 2

I’m continuing my efforts to port over and update my old blog content. The previous episode is here.

We’re still in the first few weeks of me blogging; I was pumping out articles at a rate I now consider to be quite ridiculous, but it was how I thought I was going to get readership. (And I suppose it worked.)

Here we see the emergence of some common themes throughout this blog: security-through-design, the impact of design choices on collection types, and digging into the minutia of specifications.

The requirement that JavaScript have an `eval` really limits how you can design both the language proper and its runtime implementation. In this case though we had the opposite problem; the design of the language influenced the API design for the evaluator, when we decided to add the same functionality to VBScript.

The second article was mostly a waste of time and effort; this was the second time that the original designer of JavaScript and later CEO of Mozilla who stepped down after supporting anti-equality initiatives in California, told me I was wrong, wrong, wrong, though in this case I never understood his criticism; the spec language seems straightforward to me.

The design factors inherent in array/dictionary/lookup/whatever data structures are of fundamental importance to computer programming; here I look at two things that could not be more different but have the same name.

Hi, I’m Eric and I’ll be your software developer this evening

This rant expressed a theme I frequently come back to: take responsibility for your mistakes! We all make them, and we’ll do better as individuals and as an industry if we learn from each other. Speaking of mistakes:

These mistakes were absurdly unprofessional; I was very green and should have had more adult supervision. But I learned a lot from them. Most importantly: the same tools we build to make developers’ lives easier also make attackers’ lives easier, so be careful.

I’m a traveling man, don’t tie me down

Though obviously I do not rent DVDs anymore, this is one of those “everyday algorithms” that I still use for common tasks.

Security professionals use jargon that can be very accessible, but it’s important to get it all straight. Just yesterday I was in a meeting where someone used “safe” to mean “compliant with policy” rather than “unable to harm the user”, and I found it quite confusing.

Why is a simple mathematical operation so tricky to get right? This is one of those human factors in API design, where we’ve got to think about how people’s mental model is going to go wrong.

More to come!

# Fixing Random, part 21

Last time in this series I proposed a stripped-down DSL for probabilistic workflows. Today, let’s see how we could “lower” it to ordinary C# 7 code. I’ll assume of course that we have all of the types and extension methods that we’ve developed so far. (Code for this episode is here.)

The first thing I’m going to do is describe a possible but very problematic way to do it.

We know how C# lowers methods that have `yield return` statements: it generates a class that implements `IEnumerable`, rewrites the method body as the `MoveNext` method of that class, and makes the method return an instance of the class. We could do the same thing here.

Recall that last time I gave three examples, the longest of which was:

probabilistic static IDiscreteDistribution<string> Workflow(int z)
{
int ii = sample TwoDSix();
if (ii == 2)
return “two”;
condition ii != z;
bool b = sample Flip();
return b ? “heads” : ii.ToString();
}

We could lower this to:

static IDiscreteDistribution<string> Workflow(int z) =>
new WorkflowDistribution(z);

sealed class WorkflowDistribution :
IDiscreteDistribution<string>
{
int z;
public WorkflowDistribution(int z)
{
this.z = z;
}
public string Sample()
{
start:
int ii = TwoDSix().Sample();
if (ii == 2)
return “two”;
if (!(ii != z))
goto start;
bool b = Flip().Sample();
return b ? “heads” : ii.ToString();
}
public IEnumerable<string> Support() => ???
public int Weight(string t) => ???
}

We’d need to make sure that we did the right thing if `z` got modified during the workflow and the condition was not met of course, but that’s a small detail.

We could similarly lower the other two; I won’t labour the point. It’s a lot of boilerplate code.

The good news is: we get the right distribution out of this thing:

Console.WriteLine(Workflow(3).Histogram());

```    4|***
5|****
6|*****
7|*******
8|******
9|****
10|***
11|**
12|*
two|**

The bad news is:

• We haven’t computed the support or the weights, as required.
• We are once again in a situation where a condition causes a potentially long-running loop to execute; we could do hundreds or thousands of iterations of going back to “start” for every successful sample if the condition was rarely met.

Basically, we’ve lost the great property that we had before: that we automatically get the right discrete distribution object inferred.

So let’s state unequivocally right now what our goal here is. The goal of this feature is to take a method body that describes a probabilistic workflow that includes conditions and produce a semantically equivalent distribution that does not actually contain any loop-causing condition, the same as we did for query-based workflows.

In our exploration of creating a posterior, we’ve seen that we can take a query that contains `Where` clauses and produce a projected weighted integer distribution that matches the desired distribution. Though there is some cost to producing that distribution object, once we’ve paid that cost there is no additional per-sample cost. We can do the same here, by leveraging the work we’ve already done.

The technique I’m going to use is as follows:

• Each statement in the method will be numbered starting from zero; statements nested inside `if` statements are of course statements.
• Each statement will be replaced by a local method named `Sn`, for `n` the number of the statement.
• Each local method will take as its arguments all the locals of the method. (If there are any assignments to formal parameters then they will be considered locals for this purpose.)
• Each local method will return a probability distribution.

At this point I could give you the general rule for each kind of statement in my stripped down language, but let’s not bother. Let’s just lower the methods and it will become obvious what we’re doing.

probabilistic IDiscreteDistribution<bool> Flip()
{
int x = sample Bernoulli.Distribution(1, 1);
return x == 0;
}

we rewrite this as:

IDiscreteDistribution<bool> Flip()
{
IDiscreteDistribution<bool> S0(int x) =>
Bernoulli.Distribution(1, 1)
.SelectMany(_x => S1(_x));
IDiscreteDistribution<bool> S1(int x) =>
Singleton<bool>.Distribution(x == 0);
return S0(0);
}

Read that over carefully and convince yourself that this implements the correct logic, just in a much more convoluted fashion.

Aside: Notice that we are taking advantage of the monad laws here regarding the relationship between binding and the function that produces a `Singleton`. It might not have been clear before why this is an important monad law; hopefully it is now clear.

Aside: Recall that in the jargon of monads, the `unit` operation that creates a new wrapper around a single instance of the underlying type is sometimes called `return`. It is no coincidence that our lowering turns `return` statements into singletons!

Now let’s lower the next one:

probabilistic IDiscreteDistribution<int> TwoDSix()
{
var d = SDU.Distribution(1, 6);
int x = sample d;
int y = sample d;
return x + y;
}

This becomes this mess:

static IDiscreteDistribution<int> TwoDSix()
{
IDiscreteDistribution<int> S0(
IDiscreteDistribution<int> d, int x, int y) =>
S1(StandardDiscreteUniform.Distribution(1, 6), x, y);
IDiscreteDistribution<int> S1(
IDiscreteDistribution<int> d, int x, int y) =>
d.SelectMany(_x => S2(d, _x, y));
IDiscreteDistribution<int> S2(
IDiscreteDistribution<int> d, int x, int y) =>
d.SelectMany(_y => S3(d, x, _y));
IDiscreteDistribution<int> S3(
IDiscreteDistribution<int> d, int x, int y) =>
Singleton<int>.Distribution(x + y);
return S0(null, 0, 0);
}

Again, walk through that and convince yourself that it’s doing the right thing. This implementation is ridiculously overcomplicated for what it does, but you should have confidence that it is doing the right thing. Remember, we’re trying to prove that the proposed language feature is in theory possible first, and then we’ll think about ways to make it better.

Finally, let’s do one with an `if` and a `condition`:

probabilistic IDiscreteDistribution<string> Workflow(int z)
{
int ii = sample TwoDSix();
if (ii == 2)
return “two”;
condition ii != z;
bool b = sample Flip();
return b ? “heads” : ii.ToString();
}

This lowers to:

static IDiscreteDistribution<string> Workflow(int z)
{
IDiscreteDistribution<string> S0(int ii, bool b) =>
TwoDSix().SelectMany(_ii => S1(_ii, b));
IDiscreteDistribution<string> S1(int ii, bool b) =>
ii == 2 ? S2(ii, b) : S3(ii, b);
IDiscreteDistribution<string> S2(int ii, bool b) =>
Singleton<string>.Distribution(“two”);
IDiscreteDistribution<string> S3(int ii, bool b) =>
ii != z ? S4(ii, b) : Empty<string>.Distribution;
IDiscreteDistribution<string> S4(int ii, bool b) =>
Flip().SelectMany(_b => S5(ii, _b));
IDiscreteDistribution<string> S5(int ii, bool b) =>
Singleton<string>.Distribution(b ? “heads” : ii.ToString());
return S0(0, false);
}

And at last we can find out what the distribution of this workflow is:

Console.WriteLine(Workflow(3).ShowWeights());

gives us:

```  two:2
4:3
5:4
6:5
7:6
8:5
9:4
10:3
11:2
12:1```

We can mechanically rewrite our little probabilistic workflow language into a program that automatically computes the desired probability distribution.

It should now be pretty clear what the rewrite rules are for our simple DSL:

• Normal assignment and declaration statements invoke the “next statement” method with new values for the locals.
• Assignment or declaration statements with `sample` become `SelectMany`.
• The `condition` statement evaluates their condition and then either continues to the next statement if it is met, or produces an empty distribution if it is not.
• `if` statements evaluate their condition and then invoke the consequence or the alternative helper method.
• `return` statements evaluate an expression and return a `Singleton` of it.

Exercise: I severely restricted the kinds of statements and expressions that could appear in this language. Can you see how to implement:

• `while`, `do`, `for`, `break`, `continue`, `switch` and `goto`/labeled statements
• compound assignment, increment and decrement as statements

Exercise: What about assignments, increments and decrements as expressions; any issues there?

Exercise: What about lambdas? Any issues with adding them in? Again, we’re assuming that all functions called in this system are pure, and that includes lambdas.

Exercise: I severely restricted where the `sample` operator may appear, just to make it easier on me. Can you see how we might ease that restriction? For example, it would be much nicer to write `TwoDSix` as:

probabilistic IDiscreteDistribution<int> TwoDSix()
{
var d = SDU.Distribution(1, 6);
return sample d + sample d;
}

How might you modify the lowering regimen I’ve described to handle that?

We’ll discuss possible answers to all of these in future episodes.

As I’ve noted repeatedly in this series: of course this is not how we’d actually implement the proposed feature. Problems abound:

• I’ve absurdly over-complicated the code generation to the point where every statement is its own local function.
• We are passing around all the locals as parameters; probably some of them could be realized as actual locals.
• Each method is tail recursive, but C# does not guarantee that tail recursions are optimized away, so the stack depth could get quite large.
• Several of the methods take arguments that they do not ever use.
• We could have made some simple inlining optimizations that would decrease the number of helper methods considerably.

That last point bears further exploration. Imagine we applied an inlining optimization over and over again — that is, the optimization where we replace a function call with the body of the function. Since every function call here is expression-bodied, that’s pretty easy to do!

Exercise: Do so; the answer follows. (Note that I asked how to write this workflow as a query last time; did you get the same answer?)

.

.

.

.

What would come out the other end of repeated inlining is:

static IDiscreteDistribution<string> Workflow(int z) =>
TwoDSix().SelectMany(ii =>
ii == 2 ?
Singleton<string>.Distribution(“two”) :
ii != z ?
Flip().SelectMany(b =>
Singleton<string>.Distribution(b ?
ii.ToString())) :
Empty<string>.Distribution);

Which is hardly any easier to read, but at least it is not quite so many functions.

Is this then how we could implement this feature for real? Lower it to functions, then inline the functions? Though it would work, that’s probably not how I’d do it for real. I might get to discuss a more realistic lowering technique in a future episode, but we’ll see how it goes. This series is getting pretty long and we’re not done yet!

And again, my purpose here was to show that it could be done with a straightforward, rules-based transformation; I’m not looking to find the optimal solution. It can be done, which is great!

Aside: I said a few episodes back that it would become clear why having an explicitly empty distribution is a win; hopefully it is now clear. Having an explicitly empty distribution allows us to implement `condition` using `SelectMany`! It is not at all obvious how to rewrite the workflow above into a query with a `Where` clause, but as we’ve seen, implementing the `condition` statement as conditionally producing an empty distribution gives us the ability to stick what is logically a `Where` anywhere in the workflow.

Aside: Many episodes back I gave the challenge of implementing a `Where` that took a probabilistic predicate: `Func<T, IDiscreteDistribution>`.  That is no challenge at all in our new workflow language:

bool b sample predicate(whatever);
condition b;

I sincerely hope that you agree that my proposed “imperative style” workflow is an improvement over our use of LINQ to project, condition and combine probability distributions; our `Workflow()` method is a lot easier to read as an imperative workflow than its query form, and not just because it uses the jargon of probability rather than that of sequences.

Next time on FAIC: It looks like we’ve come up with a reasonable syntactic sugar, but is it just a sugar? Is there anything that our “imperative” workflow can do that our “LINQ” workflow cannot?

# Porting old posts, part 1

Thanks again to the good people at Microsoft who have kept my old blog alive for now; my plan is to port the articles from the old site over, and then they will redirect from the old URLs to the new ones.

I’ve started the long process of porting old articles and it has been fun revisiting topics I haven’t thought about much for years.

I started this blog almost two years after I stopped working on scripting at Microsoft, but for the first several years, that’s almost all I posted about. The purpose of my blog was to get onto the web all the technical and design decisions we’d made about VBScript and JScript, and a lot of the details of my day job on Visual Studio Tools for Office did not translate well into the blog format.

IE is of course long gone, and now that Microsoft has switched to a Chromium-based browser for Edge, all of this stuff is now “for historical purposes only”; surely there are no developers left who need this information. Though that does make me slightly sad, I have to remember that software is plastic and we’re in the business of doing better than our past selves; sometimes that means starting afresh.

Anyways, as I port articles over I’ll post links to them here, with a few reflections.

My first posts, among the most popular of all time, and surprisingly still relevant.

Getting numerous links over the years from Raymond and others was a big help in kick-starting this blog. It’s also interesting in retrospect to realize that the standard format for allocating and deallocating strings in COM came from Visual Basic of all places; the B stands for “Basic”.

The VT_DATE format is horrid and should be destroyed by fire, but I think we’re still stuck with it.

The naming convention that everyone loves to hate is actually very useful if you use it correctly.

Why does JavaScript have rounding errors?

A classic FAQ that I got all the time when supporting the script engines.

As I update these articles, I’m going to try to use “JavaScript” to refer to the language proper, and “JScript” when I am speaking specifically about the Microsoft implementation from the mid 1990s.

What does this error mean? It means that you cannot use parentheses when calling a sub! But boy, what a poor choice of syntax, that created so many questions.

The reason for the error is an oddity in the way that VBScript distinguishes between passing by value and passing by reference.

Understanding the difference between value semantics and reference semantics as they pertain to values, references and variables is a constant source of user questions. I wish we had done a better job early on of using “reference” to mean a reference to an object, and “alias” to mean a reference to a variable. But, we didn’t, and now we’re stuck with lots of beginner confusion.

That’s why this was a frequent topic in the early days of my blog, and came back again in the C# years.

This also marks the first of many quizzes and puzzles, to be answered in future episodes.

Smart pointers are too smart

The first really “controversial” post where I took a stand and wrote a rant. I stand by my opinion.

Functions without names, obviously. This is my first foray into writing about functional programming.

These were the first posts where I really got into the implementation choices of a programming language.

This was also the first of many times I was told I was WRONG WRONG WRONG by Brendan Eich, the designer of JavaScript, later to be the CEO of Mozilla who stepped down after supporting anti-equality initiatives in California.

The script engines use a wacky threading model, but it was built like that for reasons. However, if you use it wrong in ASP, you’ll quickly wreck the performance of your server as it tries to follow all the rules.

And if the developer of the components makes a boneheaded mistake, as I surely did, he learns all this stuff in a hurry.

Hard core denotational semantics

I discuss the highfalutin specification techniques used by the maintainers of the ECMAScript 4 specification; unfortunately, that version of the spec famously never shipped, and ECMAScript 5 went back to operational semantics.

# Fixing Random, part 20

Without further ado, here’s my proposed stripped-down C# that could be a DSL for probabilistic workflows; as we’ll see, it is quite similar to both enumerator blocks from C# 2 and async/await from C# 5. (Code for this episode can be found here.)

• The workflow must be a function with return type of `IDiscreteDistribution<T>` for some `T`.
• Just as methods that use `await` must be marked `async`, our DSL methods must be marked `probabilistic`.
• The function has an ordinary function body: a block of statements.

Obviously I am using C# 5 asynchronous workflows as an example of how to add a new mechanism to C#. The statements and the expressions in a probabilistic method are restricted as follows:

• Declaration statements must have an initializer.
• All the locals must have unique names throughout the method.
• Assignments can only be in declarations or as statements; no use of assignments in the middle of an expression is allowed.
• No compound assignments, increments or decrements.
• No `await`, `yield`, `try`, `throw`, `switch`, `while`, `do`, `break`, `continue`, `goto`, `fixed`, `lock`, `using` or labeled statements allowed. What’s left? `if` and `return` statements are fine, as are blocks `{ }`. I told you I was stripping things down!
• No lambdas, anonymous functions or query comprehensions.
• No use of `in`, `out` or `ref`.
• All the other perfectly normal operations are allowed — function calls, object initializers, member access, indexers, arithmetic, that’s all just fine.
• However, all function calls and whatnot must be pure; there can be no side effects, and nothing should throw.

Basically what I’m doing here is making a little super-simple subset of C# that doesn’t have any of the weird, hard stuff that keep compiler writers busy. In this pleasant world locals are always assigned, there are no closures, there are no worries about what happens when an exception is thrown, and all that sort of stuff.

This is pretty draconian, I know. We’ll sketch out how to soften some of those restrictions in later episodes. I want to show that we can describe how to implement a simple core of a language; harder features can come later.

To this little language I’m going to add a new unary operator and a new statement. The new operator is `sample`, and it may appear only as the right operand of an assignment, or the initializer of a declaration:

int x = sample WeightedInteger.Distribution(8, 2, 7, 3);

The operand must be of type `IDiscreteDistribution<T>`, and the type of the expression is `T`. Again, this should feel familiar if you’re used to `await`.

The new statement is `condition`, and it has the form

condition expression;

The expression must be convertible to `bool`. The meaning of this thing is much the same as `Where`: if the Boolean condition is not met, then a value is filtered out from the distribution by setting its weight to zero. But I’m getting ahead of myself.

Aside: Last time I talked a bit about how language designers must choose how general or specific a language element is, and that this must be reflected in syntax; this is a great example of such a choice.

I’ve chosen `condition` because I think of this operation as creating a conditional distribution; I could have said `where` instead of `condition` and had it mean basically the same thing, but that would be moving towards the established C# jargon for sequences, which I’m explicitly trying to avoid.

However, as we’ve seen, a primary usage case for a conditional distribution is computing the posterior distribution when given a prior and a likelihood. But why do we wish to compute the posterior at all? Typically because we have observed something. That is, the development cycle of the probabilistic program is:

• Before the program is written, data scientists and developers compute priors, like “What percentage of emails are spam?”
• They also compute likelihood functions: “what word combinations are likely, given that an email is spam? What word combinations are likely given that an email is not spam?”
• A spam detection program must now answer the question: given that we have observed an email with certain word combinations, what is the posterior probability that it is spam? This is where the probabilistic workflow comes in.

For that reason, we could reasonably choose `observation` or `observe` as our keyword here, instead of `condition`, emphasizing to the developer “the reason you use this feature is to express the computation of a posterior distribution for a given observation”. That would make the feature feel a little bit less general, but might help more clearly express the desired semantics in the mind of the typical programmer designing a workflow that computes a posterior.

I’m going to stick with `condition`, but I just wanted to point out that this is a choice that has user experience consequences, were we actually to implement this feature in a language like C#.

Let’s look at some examples:

probabilistic IDiscreteDistribution<bool> Flip()
{
int x = sample Bernoulli.Distribution(1, 1);
return x == 0;
}

What I would like is for this method to have the exact same semantics as if I had written:

IDiscreteDistribution<bool> Flip() =>
Bernoulli.Distribution(1, 1).Select(x => x == 0);

What do you think about these two implementations? I think the former looks a lot more like the straightforward, imperative code that we’re used to.

Let’s look at another:

probabilistic IDiscreteDistribution<int> TwoDSix()
{
var d = SDU.Distribution(1, 6);
int x = sample d;
int y = sample d;
return x + y;
}

Again, it seems to me that this is more clear than

IDiscreteDistribution<int> TwoDSix()
{
var d = SDU.Distribution(1, 6);
return from x in d from y in d select x + y;
}

And it certainly seems more clear, particularly to the novice, than

IDiscreteDistribution<int> TwoDSix()
{
var d = SDU.Distribution(1, 6);
return d.SelectMany(x => d, (x, y) => x + y);
}

LINQ is awesome for sequences, but I think the statement-based workflow is much easier to understand for distributions.

Now let’s look at a really complicated workflow, this one with conditioning:

probabilistic IDiscreteDistribution<string> Workflow(int z)
{
int ii = sample TwoDSix();
if (ii == 2)
return “two”;
condition ii != z;
bool b = sample Flip();
return b ? “heads” : ii.ToString();
}

The first two were easy to see how they corresponded to query syntax, but what even is the distribution represented by this workflow? It depends on the value of the parameter `z`, for one thing.

What I want here is: when you call this method, you get a distribution back. When you `Sample()` that distribution, it should logically have the same effect as: all the `sample` operations `Sample()` their operand, and the returned value is, well, the returned value. However, if any `condition` is false then we abandon the current attempt and start over.

The trick is implementing those semantics without actually running the body in a loop!

Exercise: Pick a value for `z`; let’s say 3. See if you can work out what the distribution of strings is that should come out the other end of this thing. I’ll give the answer in the next episode.

Exercise: If you had to represent this workflow as a query, how would you do it?

Next time on FAIC: Suppose we were writing a compiler for this feature. We know that LINQ works by turning query comprehensions into function calls; we know that the compiler turns async workflows and iterator blocks build state-machine coroutines. How might we lower this new kind of code into ordinary C# 7 code? Probably somehow using our existing query operators… but what exactly would work?

# Fixing Random, part 19

I’ve got no code for you this time; instead here are some musings about language design informed by our discussion so far.

One of the most important questions to ask when designing a language feature is: what should the balance be between generality and specificity of a feature?

My go-to example is LINQ, and what we’ve been talking about for the last dozen episodes illustrates the problem nicely.

As we’ve seen many times in this blog, and particularly in this series, query comprehensions allow us to concisely express projection (`Select`), binding (`SelectMany`) and use of the zero of an additive monad (`Where`) to express a condition. Because C# uses a syntactic transformation to implement LINQ, we can express these concepts on an instance of any monad that has the right methods. The compiler does not know or care whether the thing being selected from is a sequence, a probability distribution, whatever, doesn’t matter. All that matters is that a method with the right signature is available as an instance or extension method.

The language designers for C# 3.0 were therefore faced with a choice: how general should the LINQ feature be? There’s a spectrum of possibilities; a few points on that spectrum are:

• We all know that a monad is just a monoid in the category of endofunctors, which implies that there are other patterns like “monoid” and “category” and “functor” that we could be describing in the type system; what’s so special about monads? We could create a whole “higher-kinded” type system to describe generic type patterns; “monad” is just one possible pattern.
• We could embed monads as a special kind of thing right into the language, using the vocabulary of monads: `bind`, `unit` and so on. In this language, sequences are just a special case of the more general monad feature.
• The feature could be made specifically about operations on sequences: not just `Select`, `SelectMany` and `Where` which make sense on multiple monads, but also operations that do not apply across monads, like `GroupBy`, `Join` and `OrderBy`. The operations could be applicable to any data source that supports those operations, whether XML documents or database tables or plain old lists, but the concepts in the API are tailored to the domain of data in the form of a sequence of things, but it has to be a sequence.
• The feature could be narrowly tailored to the needs of a particular data technology; LINQ could have been simply allowing SQL Server query syntax to be embedded directly in the language, and it could have been made to work only with SQL Server and no other dialects of SQL or other database back ends.

The designers of C# chose a point on the spectrum slightly more general than the third point: the feature is not written in the jargon of monads, but it is more general than simple operations on sequences. The designers of Haskell chose maximum generality, and so on.

These are of course not the only options; there were any number of points on that spectrum where the language design could have fallen, and different language designers have chosen different points. Think about list comprehensions in Python, for instance; they are obviously similar to LINQ in many respects, but you’re not going to be marshaling a list comprehension to a database or using it as a generalized monadic bind operation any time soon. There’s nothing wrong with that; it was a reasonable choice for Python.

Our exploration of representing distributions as a monad and combining them using LINQ operators and comprehensions illustrates the pros and cons of C#’s choice. Though I think it is delightful that we can use `Select`, `SelectMany` and `Where` on probability distributions as though they were infinite sequences, and pleased that we can optimize the results using the laws of probability for discrete distributions, the language designer part of me feels like this is somehow wrong. We’re using the vocabulary of sequences, not the vocabulary of probability, and it feels out of place.

This in particular was brought into focus for me in my April 1st bonus episode. Though it was in parts deliberately obtuse and misleading, there was no real joke there; rather, I wished to illustrate a serious point with a silly example. Consider these two queries:

```from car in doors
from you in doors
from monty in (
from m in doors
where m != car
where m != you
select m)
select (car, you, monty)```

and

```from car in doors
from you in doors
from monty in doors
where monty != car
where monty != you
select (car, you, monty)```

Plainly we can refactor the first into the second if `doors` is a sequence; this program transformation preserves semantics for sequences. But if `doors` is a distribution then the refactoring preserves the support but changes the weights.

This is some evidence that query comprehensions are perhaps not the best way to express operations on distributions: our intuitions about what refactorings are correct is heavily influenced by our understanding of sequence semantics, and this could lead to bugs.

Moreover, think about how the design of C# has evolved with respect to monadic types:

• C# 2 added nullable types and embedded nullable arithmetic and conversions in the language, but did not provide a general mechanism for lifting function calls and other operations to nullables until C# 6 added `?.`
• C# 2 added statement-based sequence workflows in the form of `yield return`.
• C# 3 added query comprehensions for composing operations on sequences in an expression context
• Reactive extensions (Rx) leveraged LINQ to support observables — push-based sequences — but did not add any new syntax; it could do this because the duality between pull- and push-based sequences is strong and our intuitions largely still give correct outcomes.
• C# 5 added statement-based asynchronous workflows; though you can make async lambdas and use the `await` operator in the middle of an expression, asynchronous workflows are fundamentally collections of statements, not fluent compositions of expressions like query comprehensions.

You’ll notice that there is no overarching philosophy here; rather, as the needs of real-world developers evolve, the language evolves to represent different concepts at different levels of generality.

Async workflows could have been made to look just like Rx query expressions; after all, a task is logically an observable that pushes a single value instead of multiple values. But the designers of C# figured that it would be more natural for developers to mix asynchrony into their statement-based workflows. Similarly, nullables could be treated as sequences that contain zero or one value, but they’re not.

The lesson of C# here is: when you’re trying to represent a new concept, try creating a domain-specific subset of the language that solves the problem using the vocabulary of the problem, but with enough generality to be extensible in the future.

Could we do the same for stochastic workflows? Using LINQ as our representation has been cute, fun and educational, but as language designers we can recognize that real-world users would likely find this choice confusing and weird; distributions are like sequences in many ways, but they are also dissimilar in a lot of ways.

Can we come up with a domain-specific language subset that better matches the domain of probabilistic programming, but preserves (or improves upon) the nice properties that we’ve seen so far? We’ve seen that we can express a discrete-distribution workflow as a comprehension and automatically get an inferred distribution out the other end; could we do the same thing in a statement-based workflow language, akin to async/await?

Next time on FAIC: I’m going to sketch out a super-stripped down version of C# and add two new statements for probabilistic workflows. We’ll then show how that stripped-down version could be lowered to ordinary C# 7.