# Fixing Random, part 32

Last time on FAIC we reviewed the meaning of “expected value”: when you get a whole bunch of samples from a distribution, and a function on those samples, what is the average value of the function’s value as the number of samples gets large?

I gave a naive implementation:

public static double ExpectedValue<T>(
this IDistribution<T> d,
Func<T, double> f) =>
d.Samples().Take(1000).Select(f).Average();

public static double ExpectedValue(
this IDistribution<double> d) =>
d.ExpectedValue(x=>x);

Though short and sweet, this implementation has some problems; the most obvious one is that hard-coded 1000 in there; where did it come from? Nowhere in particular, that’s where.

It seems highly likely that this is either too big, and we can compute a good estimate in fewer samples, or that it is too small, and we’re missing some important values.

Let’s explore a scenario where the number of samples is too small. The scenario will be contrived, but not entirely unrealistic.

Let’s suppose we have an investment strategy; we invest a certain amount of money with this strategy, and when the strategy is complete, we have either more or less money than we started with. To simplify things, let’s say that the “outcome” of this strategy is just a number between 0.0 and 1.0; 0.0 indicates that the strategy has completely failed, resulting in a loss, and 1.0 indicates that the strategy has completely succeeded, resulting in the maximum possible return.

Before we go on, I want to talk a bit about that “resulting in a loss” part. If you’re a normal, sensible investor like me and you have \$100 to invest, you buy a stock or a mutual fund for \$100 because you believe it will increase in value. If it goes up to \$110 you sell it and pocket the \$10 profit. If it goes down to \$90, you sell it and take the \$10 loss. But in no case do you ever lose more than the \$100 you invested. (Though of course you do pay fees on each trade whether it goes up or down; let’s suppose those are negligible.) Our goal is to get that 10% return on investment.

Now consider the following much riskier strategy for spending \$100 to speculate in the market: suppose there is a stock currently at \$100 which we believe will go down, not up. We borrow a hundred shares from someone willing to lend them to us, and sell them for \$10000. We pay the lender \$100 interest for their trouble. Now if the stock goes down to \$90, we buy the hundred shares back for \$9000, return them to the owner, and we’ve spent \$100 but received \$1000; we’ve made a 900% return on the \$100 we “invested”. This is a “short sale”, and as you can see, you get a 900% return instead of a 10% return.

(I say “invested” in scare quotes because this isn’t really investing; it’s speculation. Which is a genteel word for “gambling”.)

But perhaps you also see the danger here. Suppose the stock goes down but only to \$99.50. We buy back the shares for \$9950, and we’ve only gained \$50 on the trade, so we’ve lost half of our \$100; in the “long” strategy, we would only have lost fifty cents; your losses can easily be bigger with the short strategy than with the long strategy.

But worse, what if the stock goes up to \$110. We have to buy back those shares for \$11000, so we’ve “invested” \$100 and gotten a “return” of -\$1000, for a total loss of \$1100 on a \$100 investment. In a short sale if things go catastrophically wrong you can end up losing more than your original investment. A lot more!

As I often say, foreshadowing is the sign of a quality blog; let’s continue with our example.

We have a process that produces values from 0.0 to 1.0 that indicates the “success level” of the strategy. What is the distribution of success level? Let’s suppose it’s a straightforward bell-shaped curve with the mean on the “success” side:

This is a PDF, and extra bonus, it is normalized: the total area under the curve is 1.0. The vast majority of the time — 99.9% — our investment strategy produces an outcome between 0.48 and 1.0, so that’s the area of the graph I’m going to focus on.

Aside: I know it looks a little weird seeing a bell-shaped curve that just cuts off abruptly at 1.0, but let’s not stress about it. Again, this is a contrived example, and we’ll see why it is irrelevant later.

Now, I don’t intend to imply that the “success level” is the return: I’m not saying that most of the time we get a 75% return. Let’s suppose that we’ve worked out a function that translates “strategy outcome level” to the percentage gain on our investment. That is, if the function is 0.0, we’ve not gained anything but we’ve not lost anything either. If it is 0.5 then our profits are 50% of our investment; if it’s -0.25 then we’ve taken a net loss of 25% of our investment, and so on.

Let’s propose such a function, and zoom in on the right side of the diagram where the 99.9% of cases are found: from 0.46 to 1.0. Here’s our function:

If our process produces an outcome of 0.54 or larger, we make between a 0% and an 18% return, which seems pretty good; after all the vast majority of the bulk of the distribution is to the right of 0.54. Now, in the unlikely case where we get an outcome from 0.48 to 0.54, we lose money; on the left hand end, we’re losing almost 200%; that is, if we put in \$100 we not only fail to make it back, we end up owing \$100.

The question at hand, of course, is what is the average value of the “profit function” given the distribution of outcomes? That is, if we ran this strategy a thousand times, on average what return would we get?

Well, we already wrote the code, so let’s run it: (Code can be found here.)

var p = Normal.Distribution(0.75, 0.09);
double f(double x) =>
Atan(1000 * (x  .45)) * 20  31.2;
Console.WriteLine(\$”{p.ExpectedValue(f):0.##});

Aside: I’m ignoring the fact that if we get a sample out of the normal distribution that is greater than 1.0 or less than 0.0 we do not discard it. The region to the left of 0.0 is absurdly unlikely, and the region to the right of 1.0 has low probability and the value of the profit function is very flat. Writing a lot of fiddly code to correct for this problem would not change the estimated expected value much and would be a distraction from the far deeper problem we’re about to discover.

Also, I’m limiting the distribution to a support of 0.0 to 1.0 for pedagogical reasons; as we’ll see in later episodes, we will eventually remove this restriction, so let’s just ignore it now.

We get the answer:

`0.14`

Super, we will get on average a 14% return on our investment with this strategy. Seems like a good investment; though we have a small chance of losing big, the much larger chance of getting a solid 0% to 18% return outweighs it. On average.

Aside: It is important to realize that this 14% figure does not necessarily imply that typically we get a 14% return when we run this strategy, any more than the expected value of 3.5 implies that we’ll typically get 3.5 when we roll a d6. The expected value only tells you what the average behaviour is; it doesn’t tell you, for example, what the probability is that you’ll lose everything. It tells you that if you ran this strategy a million times, on average you’d get a 14% return, and that’s all it tells you. Be careful when making decisions relying upon expected values!

This 14% expected result totally makes sense, right? We know most of the time the result will be between 0% and 18%, so 14% seems like a totally reasonable expected value.

As one of my managers at Microsoft used to say: would you bet your car on it?

Let’s just run the code again to be sure.

`0.14`

Whew. That’s a relief.

You know what they say: doing the same thing twice and expecting different results is the definition of practicing the piano, but I’m still not satisfied. Let’s run it another hundred times:

`0.14, 0.08, 0.09, 0.14, 0.08, 0.14, 0.14, 0.07, 0.07, 0.14, ...`

Oh dear me.

The expected return of our strategy is usually 14%; why it is sometimes half that? Yes, there is randomness in our algorithm, but surely it should not cause the value computed to vary so widely!

It seems like we cannot rely on this implementation, but why not? The code looks right.

As it turns out, I showed you exactly the wrong part of the graphs above. (Because I am devious.) I showed you the parts that make up 99.9% of the likely cases. Let’s look at the graph of the profit-or-loss function over the whole range of outcomes from 0.0 to 1.0:

OUCH. It’s an approximation of a step function; in the original graph that started at 0.46, we should have noticed that the extremely steep part of the curve was trending downwards to the left at an alarming rate. If you have an outcome of 0.48 from our process, you’re going to lose half your money; 0.46, you lose all your money and a little bit more. But by the time we get to 0.44, we’re down to losing over 60 times your original investment; this is a catastrophic outcome.

How likely is that catastrophic outcome? Let’s zoom in on just that part of the probability distribution. I’ll re-scale the profit function so that they fit on the same graph, so you can see where exactly the step happens:

This is a normalized PDF, so the probability of getting a particular outcome is equal to the area of the region under the curve. The region under the curve from 0.38 to 0.45 is a little less than a triangle with base 0.07 and height 0.02, so its area is in the ballpark of 0.0007. We will generate a sample in that region roughly once every 1400 samples.

But… we’re computing the expected value by taking only a thousand samples, so it is quite likely that we don’t hit this region at all. The expected losses in that tiny area are so enormous that it changes the average every time we do sample from it. Whether you include a -60 value or not in the average is a huge influence as to whether the average is 0.14 or 0.07.

Aside: if you run the expected value estimate a few thousand times or more it just gets crazier; sometimes the expected value is actually negative if we just by luck get more than one sample in this critical region.

I contrived this example to produce occasional “black swan events“; obviously, our naïve algorithm for computing expected value does not take into account the difficulty of sampling a black swan event, and therefore produces inaccurate and inconsistent results.

Next time on FAIC: We know how to solve this problem exactly for discrete distributions; can we use some of those ideas to improve our estimate of the expected value in this scenario?

# Fixing Random, part 31

Last time in this series we saw that we could compute a continuous posterior distribution when given a continuous prior and a discrete likelihood function; I hope it is clear how that is useful, but I’d like to switch gears for a moment and look at a different (but also extremely useful) computation: the expected value.

I’ll start with a quick refresher on how to compute the expected value of a discrete distribution.

You probably already know what expected value of a discrete distribution is; we’ve seen it before in this series. But in case you don’t recall, the basic idea is: supposing we have a distribution of values of a type where we can meaningfully take an average; the “expected value” is the average value of a set of samples as the number of samples gets very large.

A simple example is: what’s the expected value of rolling a standard, fair six-sided die? You could compute it empirically by rolling 6000d6 and dividing by 6000, but that would take a while.

Aside: Again, recall that in Dungeons and Dragons, XdY is “roll a fair Y-sided die X times and take the sum”.

We could also compute this without doing any rolling; we’d expect that about 1000 of those rolls would be 1, 1000 would be 2, and so on.  So we should add up (1000 + 2000 + … + 6000) / 6000, which is just (1 + 2 + 3 + 4 + 5 + 6) / 6, which is 3.5. On average when you roll a fair six-sided die, you get 3.5.

We can then do variations on this scenario; what if the die is still fair, but the labels are not 1, 2, 3, 4, 5, 6, but instead -9, -1, 0, 1, 3, 30?  As I’m sure you can imagine, once again we can compute the expected value by just taking the average: (-9 – 1 + 0 + 1 + 3 + 30) / 6 = 4.

Aside: It’s a bit weird that the “expected value” of a distribution is a value that is not even in the support of the distribution; I’ve never once rolled a 3.5 on a d6. Beginners sometimes confuse the expected value with the mode: that is, the value that you’d expect to get more often than any other value. Remember, the expected value is just an average; it only makes sense in distributions where the sampled values can be averaged.

What if the die isn’t fair? In that case we can compute a weighted average; the computation is the value of each side, multiplied by the weight of that side, sum that, and divide by the total weight. As we saw in a previous episode:

public static double ExpectedValue(
this IDiscreteDistribution<int> d) =>
d.Support()
.Select(s =>
(double)s * d.Weight(s)).Sum() / d.TotalWeight();

And of course we could similarly define methods for discrete distributions of double and so on. Hopefully that is all clear.

The question I want to explore in the next few episodes requires us to make a small extension of the meaning of “expected value”:

• Suppose we have a distribution of outcomes `d`, in the form of an `IWeightedDistribution<double>`
• Suppose we have a function `f` from double to double which assigns a value to each outcome.
• We wish to accurately and efficiently estimate the average value of `f(d.Sample())` as the number of samples becomes large.

Aside: If we had implemented `Select` on weighted distributions, that would be the same as the expected value of `d.Select(f)`— but we didn’t!

Aside: We could be slightly more general and say that the distribution is on any `T`, and `f` is a function from `T` to double, but for simplicity’s sake we’ll stick to continuous, one-dimensional distributions in this series. At least for now.

There’s an obvious and extremely concise solution; if we want the average as the number of samples gets large, just compute the average of a large number of samples! It’s like one line of code. Since we make no use of any weights, we can take any distribution:

public static double ExpectedValue(
this IDistribution<double> d,
Func<double, double> f) =>
d.Samples().Take(1000).Select(f).Average();

In fact, we could make this even more general; we only need to get a number out of the function:

public static double ExpectedValue<T>(
this IDistribution<T> d,
Func<T, double> f) =>
d.Samples().Take(1000).Select(f).Average();

We could also make it more specific, for the common case where the function is an identity:

public static double ExpectedValue(
this IDistribution<double> d) =>
d.ExpectedValue(x => x);

Let’s look at a couple of examples. (Code for this episode can be found here.) Suppose we have a distribution from 0.0 to 1.0, say the beta distribution from last time, but we’ll skew it a bit:

var distribution = Beta.Distribution(2, 5);
Console.WriteLine(distribution.Histogram(0, 1));

```      ****
*******
*********
*********
***********
************
*************
**************
***************
****************
*****************
******************
*******************
********************
**********************
***********************
************************
***************************
*****************************
----------------------------------------
```

It looks like we’ve heavily biased these coins towards flipping tails; suppose we draw a coin from this mint; what is the average fairness of the coins we draw? We can just draw a thousand of them and take the average to get an estimate of the expected value:

Console.WriteLine(distribution.ExpectedValue());

```0.28099740981762
```

That is, we expect that a coin drawn from this mint will come up heads about 28% of the time and tails 72% of the time, which conforms to our intuition that this mint produces coins that are heavily weighted towards tails.

Or, here’s an idea; remember the distribution we determined last time: the posterior distribution of fairness of a coin drawn from a Beta(5, 5) mint, flipped once, that turned up heads. On average, what is the fairness of such a coin? (Remember, this is the average given that we’ve discarded all the coins that came up tails on their first flip.)

var prior = Beta.Distribution(5, 5);
IWeightedDistribution<Result> likelihood(double d) =>
var posterior = prior.Posterior(likelihood)(Heads);
Console.WriteLine(posterior.ExpectedValue());

`0.55313807698807`

As we’d expect, if we draw a coin from this mint, flip it once, and it comes up heads, on average if we did this scenario a lot of times, the coins would be biased to about 55% heads to 45% tails.

So, once again we’ve implemented a powerful tool in a single line of code! That’s awesome.

Right?

I hope?

Unfortunately, this naive implementation has a number of problems.

Exercise: What are the potential problems with this implementation? List them in the comments!

Next time on FAIC: We’ll start to address some of the problems with this naive implementation.

# Fixing Random, part 30

Last time on FAIC I posed and solved a problem in Bayesian reasoning involving only discrete distributions, and then proposed a variation on the problem whereby we change the prior distribution to a continuous distribution, while preserving that the likelihood function produced a discrete distribution.

The question is: what is the continuous posterior when we are given an observation of the discrete result?

More specifically, the problem I gave was: suppose we have a prior in the form of a process which produces random values between 0 and 1. We sample from that process and produce a coin that is heads with the given probability. We flip the coin; it comes up heads. What is the posterior distribution of coin probabilities?

I find this question a bit hard to wrap my head around. Here’s one way to think about it: Suppose we stamp the probability of the coin coming up heads onto the coin. We mint and then flip a million of those coins once each. We discard all the coins that came up tails. The question is: what is the distribution of probabilities stamped on the coins that came up heads?

Let’s remind ourselves of Bayes’ Theorem. For prior P(A) and likelihood function P(B|A), that the posterior is:

P(A|B) = P(A) P(B|A) / P(B)

Remembering of course that P(A|B) is logically a function that takes a B and returns a distribution of A and similarly for P(B|A).

But so far we’ve only seen examples of Bayes’ Theorem for discrete distributions. Fortunately, it turns out that we can do almost the exact same arithmetic in our weighted non-normalized distributions and get the correct result.

Aside: A formal presentation of the continuous version of Bayes Theorem and proving that it is correct would require some calculus and distract from where I want to go in this episode, so I’m just going to wave my hands here. Rest assured that we could put this on a solid theoretical foundation if we chose to.

Let’s think about this in terms of our type system. If we have a prior:

`IWeightedDistribution<double> prior = whatever;`

and a likelihood:

`Func<double, IWeightedDistribution<Result>> likelihood = whatever;`

then what we want is a function:

`Func<Result, IWeightedDistribution<double>> posterior = ???`

Let’s suppose our result is `Heads`. The question is: what is `posterior(Heads).Weight(d)` equal to, for any `d` we care to choose? We just apply Bayes’ Theorem on the weights. That is, this expression should be equal to:

`prior.Weight(d) * likelihood(d).Weight(Heads) / ???.Weight(Heads)`

We have a problem; we do not have an `IWeightedDistribution<Result>` to get `Weight(Heads)` from to divide through. That is: we need to know what the probability is of getting heads if we sample a coin from the mint, flip it, and do not discard heads.

We could estimate it by repeated computation. We could call

`likelihood(prior.Sample()).Sample()`

a billion times; the fraction of them that are `Heads` is the weight of `Heads` overall.

That sounds expensive though. Let’s give this some more thought.

Whatever the `Weight(Heads)` is, it is a positive constant, right? And we have already abandoned the requirement that weights have to be normalized so that the area under the curve is exactly 1.0. Positive constants do not affect proportionality. We do not actually need to compute the denominator at all in order to solve a continuous Bayesian inference problem; we just assume that the denominator is a positive constant, and so we can ignore it.

So `posterior(Heads)` must produce a distribution such that `posterior(Heads).Weight(d)` is proportional to:

`prior.Weight(d) * likelihood(d).Weight(Heads)`

But that is just a non-normalized weight function, and we already have the gear to produce a distribution when we are given a non-normalized weight function; we can use our `Metropolis` class from two episodes ago. It can take a non-normalized weight function, an initial distribution and a proposal distribution, and produce a weighted distribution from it.

Aside: Notice that we don’t even need `prior` to be a distribution that we can sample from; all we need is its weight function.

That was all very abstract, so let’s look at the example I proposed last time: a mint with poor quality control produces coins with a particular probability of coming up heads; that’s our prior.

Therefore we’ll need a PDF that has zero weight for all values less than 0 and greater than 1. We don’t care if it is normalized or not.

Remember, this distribution represents the quality of coins that come from the mint, and the value produced by sampling this distribution is the bias of the coin towards heads, where 0 is “double tailed” and 1 is “double headed”.

Let’s choose a beta distribution.

Aside: I’ve implemented the very useful gamma and beta distributions, but I won’t bother to show them in the blog. The code is tedious and explaining the mathematics would get us pretty off-topic. The short version is that the beta distribution is a distribution on numbers between 0 and 1 that can take on a variety of shapes. See the source code repository for the details of the implementation, and Wikipedia for an explanation of the distribution; various possible shapes for its two parameters are given below:

The code for the Beta and Gamma distributions and the rest of the code in this episode can be found here.

Let’s take `Beta.Distribution(5, 5)` as our prior distribution. That is, the “fairness” of the coins produced by our terrible mint is distributed like this:

var prior = Beta.Distribution(5, 5);
Console.WriteLine(prior.Histogram(0, 1));

```                  ****
******
********
**********
**********
************
************
**************
***************
****************
****************
******************
********************
********************
**********************
***********************
************************
**************************
******************************
----------------------------------------```

Most of the coins are pretty fair, but there is still a significant fraction of coins that produce 90% heads or 90% tails.

We need a likelihood function, but that is easy enough. Let’s make an enumerated type for coins called `Result` with two values, `Heads` and `Tails`:

IWeightedDistribution<Result> likelihood(double d) =>

We can use `Metropolis` to compute the posterior, but what should we use for the initial and proposal distributions? The prior seems like a perfectly reasonable guess as to the initial distribution, and we can use a normal distribution centered on the previous sample for the proposal.

IWeightedDistribution<double> posterior(Result r) =>
Metropolis<double>.Distribution(
d => prior.Weight(d) * likelihood(d).Weight(r), // weight
prior, // initial
d => Normal.Distribution(d, 1)); // proposal

But you know what, let’s once again make a generic helper function:

public static Func<B, IWeightedDistribution<double>> Posterior<B>(
this IWeightedDistribution<double> prior,
Func<double, IWeightedDistribution<B>> likelihood) =>
b =>
Metropolis<double>.Distribution(
d => prior.Weight(d) * likelihood(d).Weight(b),
prior,
d => Normal.Distribution(d, 1));

And now we’re all set:

var posterior = prior.Posterior(likelihood);

```                     ****
******
*******
*********
***********
************
************
*************
**************
***************
****************
******************
******************
*******************
********************
**********************
************************
*************************
*****************************
----------------------------------------```

It is subtle, but if you compare the prior to the posterior you can see that the posterior is slightly shifted to the right, which is the “more heads” end of the distribution.

Aside: In this specific case we could have computed the posterior exactly. It turns out that the posterior of a beta distribution in this scenario is another beta distribution, just with different shape parameters. But the point is that we don’t need to know any special facts about the prior in order to create a distribution that estimates the posterior.

Does this result make sense?

The prior is that a coin randomly pulled from this mint is biased, but equally likely to be biased towards heads or tails. But if we get a head, that is evidence, however slight, that the coin is not biased towards tails. We don’t know if this coin is close to fair, or very biased towards heads, or very biased towards tails, but we should expect based on this single observation that the first and second possibilities are more probable than the prior, and the third possibility is less probable than the prior. We see that reflected in the histogram.

Or, another way to think about this is: suppose we sampled from the prior many times, we flipped each coin exactly once, and we threw away the ones that came up tails, and then we did a histogram of the fairness of those that came up heads. Unsurprisingly, the coins more likely to turn up tails got thrown away more often than the coins that came up heads, so the posterior histogram is biased towards heads.

Exercise: Suppose we changed the scenario up so that we were computing the posterior probability of two flips. If we get two heads, obviously the posterior will be even more on the “heads” side. How do you think the posterior looks if we observe one head, one tail? Is it different than the prior, and if so, how? Try implementing it and see if your conjecture was correct.

What have we learned? That given a continuous prior, a discrete likelihood, and an observation, we can simulate sampling from the continuous posterior.

Why is this useful? The coin-flipping example is not particularly interesting, but of course coin-flipping is a stand-in for more interesting examples. Perhaps we have a prior on how strongly a typical member of a population believes in the dangers of smoking; if we observe one of them smoking, what can we deduce is the posterior strength of their beliefs? Or perhaps we have a prior on whether a web site user will take an action on the site, like buying a product or changing their password, or whatever; if we observe them taking that action, what posterior deductions can we make about their beliefs?

As we’ve seen a number of times in this series, it’s hard for humans to make correct deductions about posterior probabilities even when given good priors; if we can make tools available to software developers that help us make accurate deductions, that has a potentially powerful effect on everything from the design of developer tools to the design of public policy.

Next time on FAIC: Thus far in this series we’ve been very interested in the question “given a description of a distribution, how can I sample from it efficiently?” But there are other things we want to do with distributions; for example, suppose we have a random process that produces outcomes, and each outcome is associated with a gain or loss of some amount of money. What is the expected amount we will gain or lose in the long run? We’ll look at some of these “expected value” problems, and see how the gear we’ve developed so far can help compute the answers.

# Fixing Random, part 29

[It is] a spectacular vindication of the principle that each individual coin spun individually is as likely to come down heads as tails and therefore should cause no surprise that each individual time it does.

Thus Guildenstern (or is it Rosencrantz?) in Tom Stoppard’s re-imagining of Hamlet “Rosencrantz and Guildenstern Are Dead“. If you haven’t seen it already, go watch the movie, or at least the first scene.

It helps to know the plot of Hamlet: Before the play begins, prince Hamlet’s uncle Claudius has murdered Hamlet’s father, become king, and married Hamlet’s mother Gertrude. Hamlet is understandably perturbed by this sequence of events. Claudius and Gertrude have invited Hamlet’s college friends Rosencrantz and Guildenstern to cheer him up, but SPOILER ALERT in a series of misadventures they end up dead, soon to be followed by everyone else; the plot, such as it is, of R&G Are Dead is this story told from their confused, uninformed, and amnesiac perspective.

.

.

.

Welcome back. As you now know, if you did not before, Rosencrantz and Guildenstern are flipping (or, as they say, spinning) a series of coins. In the film sometimes they are flipping a different coin each time, which Rosencrantz (or is it Guildenstern?) wins, and sometimes they’re flipping the same coin several times in a row. Either way, the coin flipper (whatever his name is) hits an enormously unlikely sequence of heads.

Let’s make up a variation on this scenario to see what we can learn about posterior  distributions. (Code for this episode can be found here.)

Suppose we have two kinds of coins: fair, and double-headed. A fair coin has a 50-50 probability of coming up heads or tails; a double-headed coin always comes up heads.

enum Coin { Fair, DoubleHeaded }

Now let’s suppose we have a bag of coins. 999 are fair, and one is double-headed. Rosencrantz is going to pull a coin from the bag at random. Our prior is that the selection is random over the thousand coins in the bag, so:

var coins = new[] { Fair, DoubleHeaded };
var prior = coins.ToWeighted(999, 1);

Once he has his coin, Rosencrantz flips it, and of course, observes that it is heads.

The question I want to explore is: what is the posterior probability that Rosencrantz just flipped the double-headed coin?

That is, Rosencrantz draws a coin, flips it, it comes up heads, and now the question to you is: what would be fair odds for a bet on the question “is it the double-headed coin or a regular coin that was just flipped?” Prior to the observation the fair odds are 1 to 999, but after we observe the flip, those odds change because we have more information.

Reasoning backwards like this is a little tricky. (Of course we call it the posterior distribution because we are reasoning backwards, in case that was unclear.) Again, let’s break it down:

• Prior to flipping the coin, the probability of getting the double-headed coin is the aptly-named prior probability: 1 in 1000, which is odds of 1 to 999.
• If Rosencrantz flipped the coin and got tails, that would be solid evidence that he did not get the double-headed coin; the posterior probability of having the double headed coin is zero. That is, the posterior probability is smaller: we’ve gone from 0.1% to 0%, a small decrease.
• If Rosencrantz got heads, that is very weak evidence that he got a double-headed coin, but it is not zero evidence. There should be some small increase over the prior in the posterior.

Of course we don’t need to do the math by hand; we know from earlier in this series that we can exactly solve for posterior probabilities by using a combination of `Where` and `SelectMany` query operators:

var results = new[] { Heads, Tails };
IDiscreteDistribution<Result> likelihood(Coin c) =>
results.ToWeighted(1, c == Fair ? 1 : 0);

var c1 = from c in prior
from r in likelihood(c)
where r == Heads
select c;
Console.WriteLine(c1.ShowWeights());

```        Fair:999

If Rosencrantz flipped heads then the probability that he drew the double-headed coin is slightly increased over the prior. The prior probability of having the double-headed coin is 1 in 1000; the posterior probability is 2 in 1001. That’s almost twice as likely, but still pretty unlikely.

Again, let’s make sure that we understand what this means. It means that if we did millions of repetitions of this experiment where Rosencrantz picks a coin and flips it once, and if we discarded all the repetitions in which Rosencrantz flipped tails, then in the remaining repetitions on average the ratio of double-headed coins to fair coins chosen would be about 2 to 999.

What if we increase the number of flips of that same coin, and he got two heads?

var c2 = from c in prior
from r1 in Likelihood(c)
where r1 == Heads
from r2 in Likelihood(c)
where r2 == Heads
select c;
Console.WriteLine(c2.ShowWeights());

```        Fair:999

The posterior probability of having the double-headed coin goes up again, this time to 4 in 1003.

What if we increased it to ten flips (again, of the same coin), and they’re all heads? By the time we get to ten heads, there are two possibilities: the exactly one-in-a-thousand chance that Rosencrantz got the double-headed coin, or the approximately one-in-a-thousand chance that he got a fair coin and flipped ten heads in a row; the ratio of those two probabilities is the posterior odds. At this point it becomes pretty much even money whether he has the double-headed coin or a fair coin. (As you might expect, the exact odds are 999 “fair” to 1024 “double-headed”, which is pretty darn close to 50-50.)

If we observe eleven or more heads, it becomes much more likely that Rosencrantz has drawn the one-in-a-thousand double-headed coin than that he is lucky enough to observe that sequence of heads from a fair coin, and of course by the time we get to 156 heads, it is absurdly more likely to be the double-headed coin than a fair coin.

This is a cute problem and all, but it seems a little out of place at this point in our series; we stopped talking about discrete distributions that we could solve exactly for posteriors a while back. I’m discussing it now because all this is in the way of introducing a variation on the above problem:

• We have a mint with really poor quality control; it produces coins that are out of balance, so that some of them favour heads and some of them favour tails.
• To model this, let’s say that the mint samples from some continuous distribution that produces values strictly between 0.0 and 1.0. Let’s say that we know this distribution.
• Each coin that is minted is modeled as a Bernoulli distribution of heads and tails, with the given probability of heads.
• Rosencrantz flips a random coin that was obtained from that mint; this is our prior.
• We observe that Rosencrantz gets heads.
• What is the posterior distribution on the fairness of that coin?

When I first started thinking about this problem I found it quite tricky to get it clear in my head, but this continuous version is not fundamentally different than the discrete version I gave above. In both cases we have a prior distribution of coin fairness that we sample from to get a coin. In both cases we then obtain an observation of “heads”. And in both cases, we can use that observation to deduce something about the likely provenance of the coin. Whether we’re in the continuous case or the discrete case, flipping heads is evidence that the coin is more likely to be on the “heads heavy” side of the prior distribution. But how much more likely?

In the all-discrete case we could compute that posterior exactly. The question before us now is: given the continuous prior, the discrete likelihood function and an observation of a discrete value, can we compute the weight function of the continuous posterior? That is, can we answer the question “what is the probable fairness level of this coin given that we’ve observed one head?” And of course, given all that information, can we also produce a distribution object that we can sample from?

Give it some thought.

Next time on FAIC: We will attempt to do so using the tools we’ve developed thus far.

# Fixing Random, part 28

Last time on FAIC we implemented a technique for sampling from a non-normalized target PDF:

• Find an everywhere-larger helper PDF that we can sample from.
• Sample from it.
• Accept or reject the sample via a coin flip with the ratio of weights in the target distribution and the helper distribution.

This technique totally works, but it has a few drawbacks:

• It’s not at all clear how to find a good helper PDF without humans intervening.
• Even with a pretty tight-fitting helper, you potentially still end up rejecting a lot of samples.

The first problem is the big one. It would be great if there was a technique that didn’t require quite so much intervention from experts.

Today I’ll describe just such a technique; it is called the “Metropolis algorithm” after one of its inventors, Nicholas Metropolis. [Code for this episode is here.]

Aside: The co-authors of the 1953 paper that first presented this algorithm are unfairly not credited in the name of the algorithm, but “the Metropolis algorithm” sounds snappy and cool, whereas “the Metropolis-Rosenbluth-Rosenbluth-Teller-and-Teller algorithm” is a bit of a mouthful.

In modern code we usually use a variation called the “Metropolis-Hastings algorithm“, but I’m not going to discuss it in this series.

Without further ado, here’s a sketch of the Metropolis algorithm:

• We have a non-normalized PDF f(t) that takes a `T` and returns a `double`. We want to produce arbitrarily many samples of `T` that conform to this distribution.
• The first time that `Sample()` is called we produce the initial sample.
• Now, obviously, if we could produce a random initial sample that matched the distribution described by f, we’d already have solved the problem! All that we require is a random initial sample that is somewhere in the support of f.
• We’ll call the distribution of the first sample the initial distribution.
• The initial sample becomes the current sample, which is returned.
• Every subsequent time that `Sample()` is called, we use the current sample to generate a candidate for the new sample; more specifically, we produce a proposal distribution of possible next samples.
• We sample the proposal distribution to get a proposed sample.
• We compute the weights f(current) and f(proposed)
• We divide the proposed weight by the current weight.
• If the quotient is greater than or equal to 1.0 then the proposed sample automatically becomes the new current sample.
• Otherwise, we flip a Bernoulli coin with that probability. Heads, we accept the proposed sample and it becomes the current sample. Tails, we keep the current sample.

Aside: Do you see why we don’t care if the PDFs are non-normalized in this algorithm? All we care about is the ratio of the weights when determining the Bernoulli distribution. Non-normalized PDFs have the same ratios as normalized PDFs!

Aside: There are a few technical restrictions on the proposal distribution that I’m not going to discuss; for details, see the Wikipedia page on ergodicity.

That might have been a bit abstract. Let’s say in type system terms what we need. First off, we have the target distribution we’re attempting to sample from:

`Func<T, double> target;`

Next, we have an initial distribution to get the first sample:

`IDistribution<T> initial;`

And finally, we have a function that produces proposal distributions:

`Func<T, IDistribution<T>> proposals;`

But… wait a minute. This looks really familiar.

• We’re producing a sequence of samples.
• We randomly choose an initial value.
• We randomly choose each subsequent value based on a probability distribution determined only by the previous value.

This is a Markov process!

Apparently we can use a Markov process to simulate sampling from an arbitrary PDF, assuming that the restrictions on the initial and proposal distributions are met.

Well this should be easy to implement, given that we’ve already implemented Markov processes, so let’s do it. This time, the fun will all be in the factory:

sealed class Metropolis<T> : IWeightedDistribution<T>
{
private readonly IEnumerator<T> enumerator;
private readonly Func<T, double> target;
public static Metropolis<T> Distribution(
Func<T, double> target,
IDistribution<T> initial,
Func<T, IDistribution<T>> proposal)
{
var markov = Markov<T>.Distribution(initial, transition);
return new Metropolis<T>(
target, markov.Sample().GetEnumerator());
IDistribution<T> transition(T d)
{
T candidate = proposal(d).Sample();
return Flip<T>.Distribution(
candidate, d, target(candidate) / target(d));
}
}
private Metropolis(
Func<T, double> target,
IEnumerator<T> enumerator)
{
this.enumerator = enumerator;
this.target = target;
}
public T Sample()
{
this.enumerator.MoveNext();
return this.enumerator.Current;
}
public double Weight(T t) => this.target(t);
}

Let’s try it out. We’ll need three things. For our target PDF, let’s take the mixed PDF we looked at last time:

double Mixture(double x) =>
Exp(x * x) + Exp((1.0  x) * (x  1.0) * 10.0);

What should our initial distribution be? All doubles are in the support of this distribution, so we can choose any, but we’d like it if the double was likely to be high-weight. Let’s suppose that we don’t know the exact shape of this distribution, but we do know that it has a lump near the origin, so we’ll just use a standard normal distribution as our initial value.

What should our proposal distribution be? It needs to be based on the current value, not biased to consistently get larger or smaller, and give us the possibility of any value in the support of our target distribution being produced. How about a normal distribution centered on the current value? That seems reasonable.

You know what, I’m going to make a helper method just for this case, since it seems likely to be pretty common:

public static Metropolis<double> NormalMetropolis(
this Func<double, double> weight) =>
Metropolis<double>.Distribution(
weight,
Normal.Standard,
d => Normal.Distribution(d, 1.0));

Finally let’s run it and see if we get a reasonable histogram out:

NormalMetropolis(Mixture).Histogram(2, 2)

```                            ***
***
***
***
*****
****     *****
*******  ******
******** *******
*****************
*******************
********************
*********************
*********************
***********************
************************
*************************
****************************
******************************
**********************************
----------------------------------------```

Nice!

This seems to work really well, but unfortunately there are a few problems to deal with. Some of them are:

• We need an initial distribution that produces a value in the support. However, this is usually not too hard. And suppose by some accident we do end up getting a zero-weight element as the first sample; what’s going to happen? Probably the proposal distribution will pick a non-zero-weight element, and it will then automatically win, so hey, maybe the first sample is bad. No big deal.
• Let’s think a bit more about that issue though. Even if the first sample has positive weight, we might have a sequence of unusually low-weight samples at the beginning. If we start in a low-weight region there might be some time spent randomly walking around before we stumble into a high-weight region. Of course once we are in a high-weight region we are likely to stay there, which is good…
• … unless the distribution is “multimodal”. That is, there are multiple high-weight regions. Our example distribution in this episode is multimodal, but the two “lumps” are really close together and are “bridged” by a reasonable high-probability region; imagine a distribution where the “lumps” were far apart with a super-low-probability region between them. You can end up “stuck” in one of the lumps for a long time, and this can lead to situations where long sequences of samples have way more correlation between them than we would expect if we were sampling from the “real” distribution.
• The previous point is particularly emphasized by the fact that in a Metropolis simulation of sampling from a distribution, we frequently get repeated values in samples; that sort of correlation is typically not observed at all were we drawing from the “real” distribution rather than a Markovian simulation of it.

All of these problems can be dealt with by a variety of techniques; you’ve probably already thought of some of them:

• We could “burn” the first few dozen / hundred / whatever of the samples so that we get past any initial low-weight region.
• If samples are too correlated with each other, we could use a proposal distribution with a higher variance. (For example, we could use a normal distribution with a larger standard deviation.) That would increase the number of rejected samples, which increases repeats, but it would decrease the number of samples that were all close to each other but not rejections.
• Or, we could sample from a Bernoulli at a probability of our choosing on each sample, and if it’s heads, we skip enumerating the value but still keep it as our new state internally; that would decrease the amount of similarity between subsequent samples, while increasing the average computational cost of each sample.
• Or, we could generate samples in batches of a hundred or a thousand or whatever, randomly shuffle the batch, and then enumerate the shuffled batch, to decrease the amount of similarity between subsequent samples.

I’m not going to do any of these things, but I’m sure you can see how they would be done.

We’ve made really good progress here; we have an efficient general mechanism for sampling from a well-behaved PDF, so long as we have some idea of the initial distribution and a good guess at a proposal distribution. We also have some ideas for “tuning parameters” — that is, knobs that experts can turn to make tradeoffs between performance, correlation, and so on, and the knobs are simple enough that you could imagine writing a program to try different settings and figure out what works best.

Next time on FAIC: We’re going to seemingly take a step backwards and talk about a problem in discrete Bayesian reasoning: given a bag of coins, some unfair, and some observations of the results of flipping a coin pulled from the bag, what can we deduce about the coin we pulled?

# Fixing Random, part 27

Last time on FAIC we went through a loose, hand-wavy definition of what it means to have a “weighted” continuous distribution: our weights are now doubles, and given by a Probability Distribution Function; the probability of a sample coming from any particular range is the area under the curve of that range, divided by the total area under the function. (Which need not be 1.0.)

A central question of the rest of this series will be this problem: suppose we have a delegate that implements a non-normalized PDF; can we implement `Sample()` such that the samples conform to the distribution?

The short answer is: in general, no.

A delegate from double to double that is defined over the entire range of doubles has well over a billion billion possible inputs and outputs. Consider for example the function that has a high-probability lump in the neighbourhood of -12345.678 and another one at 271828.18 and is zero everywhere else; if you really know nothing about the function, how would you know to look there?

We need to know something about the PDF in order to implement `Sample()`.

The long answer is: if we can make a few assumptions then sometimes we can do a pretty good job.

Aside: As I’ve mentioned before in this series: if we know the quantile function associated with the PDF then we can very easily sample from the distribution. We just sample from a standard continuous uniform distribution, call the quantile function with the sampled value, and the value returned is our sample. Let’s assume that we do not know the quantile function of our PDF.

Let’s look at an example. Suppose I have this weight function:

double Mixture(double x) =>
Exp(x * x) + Exp((1.0  x) * (x  1.0) * 10.0);

If we graph that out, it looks like this:

I called it “mixture” because it is the sum of two (non-normalized) normal distributions. This is a valid non-normalized PDF: it’s a pure function from all doubles to a non-negative double and it has finite area under the curve.

How can we implement a `Sample()` method such that the histogram looks like this?

Exercise: Recall that I used a special technique to implement sampling from a normal distribution. You can use a variation on that technique to efficiently sample from a mixture of normal distributions; can you see how to do so? See if you can implement it.

However, the point of this exercise is: what if we did not know that there was a trick to sampling from this distribution? Can we sample from it anyways?

The technique I’m going to describe today is, once more, rejection sampling. The idea is straightforward; to make this technique work we need to find a weighted “helper” distribution that has this property:

The weight function of the helper distribution is always greater than or equal to the weight function we are trying to sample from.

Now, remember, the weight function need not be “scaled” so that the area under the curve is 1.0. This means that we can multiply any weight function by a positive constant, and the distribution associated with the multiplied weight function is the same. That means that we can weaken our requirement:

There exists a constant factor such that the weight function of the helper distribution multiplied by the factor is always greater than or equal to the weight function we are trying to sample from.

This will probably be more clear with an example.

Let’s take the standard normal distribution as our helper. We already know how to sample from it, and we know it’s weight function. But it just so happens that there exists a constant — seven — such that multiplying the constant factor by the helper’s weight function dominates our desired distribution:

Again, we’re going to throw some darts and hope they land below the red curve.

• The black curve is the weight function of the helper — the standard normal distribution — multiplied by seven.
• We know how to sample from that distribution.
• Doing so gives us an x coordinate for our dart, distributed according to the height of the black curve; the chosen coordinate is more likely to be in a higher region of any particular width than a lower region of the same width.
• We’ll then pick a random y coordinate between the x axis and the black curve.
• Now we have a point that is definitely below the black line, and might be below the red line.
• If it is not below the red line, reject the sample and try again.
• If it is below the red line, the x coordinate is the sample.

Let’s implement it!

Before we do, once again I’m going to implement a Bernoulli “flip” operation, this time as the class:

sealed class Flip<T> : IWeightedDistribution<T>
{
public static IWeightedDistribution<T> Distribution(
T heads, T tails, double p)

You know how this goes; I will skip writing out all that boilerplate code. We take values for “heads” and “tails”, and the probability (from 0 to 1) of getting heads. See the github repository for the source code if you care.

I’m also going to implement this obvious helper:

static IWeightedDistribution<bool> BooleanBernoulli(double p) =>
Flip<bool>.Distribution(true, false, p);

All right. How are we going to implement rejection sampling? I always begin by reasoning about what we want, and what we have. By assumption we have a target weight function, a helper distribution whose weight function “dominates” the given function when multiplied, and the multiplication factor. The code practically writes itself:

public class Rejection<T> : IWeightedDistribution<T>
{
public static IWeightedDistribution<T> Distribution(
Func<T, double> weight,
IWeightedDistribution<T> helper,
double factor = 1.0) =>
new Rejection<T>(weight, dominating, factor);

I’ll skip the rest of the boilerplate. The weight function is just:

public double Weight(T t) => weight(t);

The interesting step is, as usual, in the sampling.

Rather than choosing a random number for the y coordinate directly, instead we’ll just decide whether or not to accept or reject the sample based on a Bernoulli flip where the likelihood of success is the fraction of the weight consumed by the target weight function; if it is not clear to you why that works, give it some thought.

public T Sample()
{
while(true)
{
T t = this.helper.Sample();
double hw = this.helper.Weight(t) * this.factor;
double w = this.weight(t);
if (BooleanBernoulli(/ hw).Sample())
return t;
}
}

All right, let’s take it for a spin:

var r = Rejection<double>.Distribution(
Mixture, Normal.Standard, 7.0);
Console.WriteLine(r.Histogram(2.0, 2.0));

And sure enough, the histogram looks exactly as we would wish:

```                             **
***
***
*****
*****
******   ******
****************
*****************
*******************
********************
*********************
*********************
***********************
************************
*************************
***************************
*****************************
*********************************
----------------------------------------```

How efficient was rejection sampling in this case? Actually, pretty good. As you can see from the graph, the total area under the black curve is about three times the total area under the red curve, so on average we end up rejecting two samples for every one we accept. Not great, but certainly not terrible.

Could we improve that? Sure. You’ll notice that the standard normal distribution times seven is not a great fit. We could shift the mean 0.5 to the right, and if we do that then we can reduce the multiplier to 4:

That is a far better fit, and if we sampled from this distribution instead, we’d reject a relatively small fraction of all the samples.

Exercise: Try implementing it that way and see if you get the same histogram.

Once again we’ve managed to implement `Sample()` by rejection sampling; once again, what are the pros and cons of this technique?

• Pro: it’s conceptually very straightforward. We’re just throwing darts and rejecting the darts that do not fall in the desired area. The darts that do fall in the desired area have the desired property: that samples from a given area arrive in proportion to that area’s size.
• Con: It is by no means obvious how to find a tight-fitting helper distribution that we can sample such that the helper weight function is always bigger than the target weight function. What distribution should we use? How do we find the constant multiplication factor?

The rejection sampling method works best when we have the target weight function ahead of time so that we can graph it out and an expert can make a good decision about what the helper distribution should be. It works poorly when the weight function arrives at runtime, which, unfortunately, is often the case.

Next time on FAIC: We’ll look at a completely different technique for sampling from an arbitrary PDF that requires less “expert choice” of a helper distribution.

# Fixing Random, part 26

We’ve been mostly looking at small, discrete distributions in this series, but we started this series by looking at continuous distributions. Now that we have some understanding of how to solve probability problems on simple discrete distributions and Markov processes, let’s go back to continuous distributions and see if we can apply some of these learnings to them. (Code for this episode can be found here.)

Let’s start with the basics. What exactly do we mean by a probability distribution function? So far in this series we’ve mostly looked at the discrete analog, a non-normalized probability mass function. That is, for an `IDiscreteDistribution<T>` we have a weight function that gives us a value for each `T`. The probability of sampling a particular `T` is the weight divided by the total weight of all `T`s.

Of course, had we decided to go with double weights instead of integers, we could have made a normalized probability mass function: that is, the “weight” of each particular `T` is automatically divided by the total weight, and we get the probability out. In this scenario, the total weight adds up to 1.0.

We know from our exploration of the weighted integer distribution that we can think of our probability distributions as making a rectangle where various sub-rectangles are associated with particular values; we then “throw a dart” at the rectangle to sample from the distribution; where it lands gives us the sample.

Continuous distributions can be thought of in much the same way. Suppose we have a function from double to double, always non-negative, such that the total area under the curve is 1.0. Here are some examples:

double PDF1(double x) => x < 0.0 | x >= 1.0 ? 0.0 : 1.0;
double PDF2(double x) => x < 0.0 | x >= 1.0 ? 0.0 : 2 * x;
double PDF3(double x) => Exp((x * x)) / Sqrt(PI);

What is the meaning of these as probability distributions? Plainly the “higher” the function, the “more likely” any particular value is, but what does it even mean to say that in our PDF2 and PDF3 distributions, that 0.5 is “less likely” than 0.6 but they are “equally likely” in our PDF1 distribution?

One way to think of it is that again, we “throw a dart” at the area under the curve. Given any subset of that area, the probability of the dart landing inside it is proportional to the area of the subset, and the value sampled is the x coordinate of the dart.

We can make this a little bit more formal by restricting our areas to little rectangles:

• Take a value, say 0.5.
• Now take a tiny offset, call it `ε`. Doesn’t matter what it is, so long as it is “pretty small”.
• The probability of getting a sample value between `0.5` and `0.5+ε` is `PDF(0.5)*``ε`. That is, the area of a thin rectangle.

That’s it! Let’s say we’ve got values 0.5 and 0.6, and an `ε` of 0.01. How do our three distributions compare with each other near these values?

• With `PDF1`, the probability of getting a sample between 0.5 and 0.51 is approximately 0.010, and the probability of getting a sample between 0.6 and 0.61 is also approximately 0.010.
• With `PDF2`, the probability of getting a sample between 0.5 and 0.51 is approximately 0.010, and the probability of getting a sample between 0.6 and 0.61 is approximately 0.012.
• With `PDF3`, the probability of getting a sample between 0.5 and 0.51 is approximately 0.006, and the probability of getting a sample between 0.6 and 0.61 is approximately 0.004

And moreover, the approximation becomes better as `ε` gets smaller.

Aside: What if we want to know the probability of getting a value between two doubles that are not “really close together”? In that case we can do “quadrature”: divide it up into a bunch of regions that are all “really small” and sum up the areas of all those little rectangles to get an approximation of the area. Or, we can use integral calculus to find the exact answer. But for the purposes of this series, we’re going to try to find ways to sample from a distribution that has a particular PDF without doing any quadrature approximations or any integral calculus.

All right, we understand what PDFs are. To summarize:

• A PDF is a pure function from double to double.
• A PDF is defined over the entire range of doubles.
• `PDF(x) >= 0.0` for all `x`.
• Integrating the PDF over the entire real line gives us a total area of 1.0.
• The probability that a sample is between `x` and `x+ε` is approximately `PDF(x)*``ε`, if `ε` is small.

For our purposes though I’m going to relax those conditions a bit. In the code I want to explore, we’ll assume that a “non-normalized” PDF has these properties:

• A PDF is a pure function from `T` to `double`​…
• … defined over the entire range of `T`.
• `PDF(t) >= 0.0` for all `t`.
• Integrating the PDF over all possible values of `T` gives us a total area of `A`, for some finite positive value `A`. That is, the distribution is not necessarily “normalized” to have an area of 1.0.  (You’d think that we can “easily” get a normalized distribution from a non-normalized distribution by dividing every weight by `A` but it is not always easy to know what that area is. But perhaps sometimes we do not need to know! Foreshadowing!)
• The probability that a sample is within `ε` of `t` is approximately `PDF(t)*ε/A` if `ε` is small. (Of course, as I noted above, we might not know the normalization constant.)

Aside: That last point is a little vague; what does it mean to be “within `ε` of some `t`” if `t` is not `double`? We’re going to just not worry about that. It’ll be fine. Particularly since we’re mostly going to look at situations where `T` is `double` anyways.

This becomes useful for cases like the multidimensional case where `T` is `(double, double)`, say, and by “within `ε`” we mean “inside a square of side `√ε`” or some such thing. Putting this all on a solid mathematical footing would take us too far off topic; you get the idea and let’s move on. I don’t know if we’ll get to multidimensional distributions in this series or not.

As we’ll see, continuous distributions are a much harder beast to tame than the small discrete distributions we’ve looked at before. However, there are still things we can do with them. Let’s start by making a definition in the type system for a distribution that has one of our relaxed PDFs. It’s awfully familiar:

public interface IWeightedDistribution<T> : IDistribution<T>
{
double Weight(T t);
}

This means that every discrete distribution gets to be a weighted distribution pretty much “for free”. Let’s implement that:

public interface IDiscreteDistribution<T> : IWeightedDistribution<T>{
IEnumerable<T> Support();
new int Weight(T t);
}

And now we have some minor taxes to pay; we’ll need to add

double IWeightedDistribution<R>.Weight(T t) => this.Weight(t);

to all our existing discrete distributions, but that is easily done.

We can also go back and fill in weight functions in our `Normal` class:

private static readonly double piroot = 1.0 / Sqrt(2 * PI);
public double Weight(double x) =>
Exp((x  μ) * (x  μ) / (2 * σ * σ)) * piroot / σ;

And our `StandardContinuousUniform` class:

public double Weight(double x) =>
0.0 <= x & x < 1.0 ? 1.0 : 0.0;

Easy peasy.

We’ve got a new variation on our old concept of weighting a distribution; let’s see where we can go with this.

Next time on FAIC: We know how to efficiently sample from any small, discrete distribution with integer weights, but distributions based on arbitrary PDFs are much harder beasts to tame. Given an arbitrary non-normalized PDF, can we sample from it?

# Fixing Random, part 25

Last time on FAIC we implemented the Markov process distribution, which is a distribution over state sequences, where the initial state and each subsequent state is random. There are lots of applications of Markov processes; a silly one that I’ve been wanted to do for literally decades is randomized text generation. (The silly code for this silly episode is here; bring your own corpus!)

In my youth I collected back issues of Scientific American, which I bought at the sadly now defunct Casablanca Books in my home town. I believe I sold all of them back when I moved out to Seattle; they were pretty bulky, though I might still have my autographed-by-Benoit-Mandelbrot copy of the issue about fractals somewhere in the house.

My favourite columns in old SciAms were of course the mathematical / programming games columns by Martin Gardner, Kee Dewdney and Douglas Hofstadter. The one germane to today’s topic was a Dewdney column about “Mark V. Shaney” — what we would now call a “bot”, that was running a Markov process generated by analysis of USENET posts, sampling from the resulting distribution to produce a “Markov chain” sequence (Mark V. Shaney, get it?) and posting the resulting generated text right back to USENET.

I’ve always wanted to replicate that, and today I have my chance. The basic technique is straightforward:

• Start with a corpus of texts.
• Break up the text into words.
• Group the words into sentences.
• Take the first word of each sentence and make a weighted distribution; the more times this word appears as a first word in a sentence, the higher it is weighted. This gives us our initial distribution.
• Take every word in every sentence. For each word, generate a distribution of the words which follow it in the corpus of sentences. For example, if we have “frog” in the corpus, then we make a distribution based on the words which follow: “prince” twice, “pond” ten times, and so on. This gives us the transition function.
• There are a number of ways to handle words which end sentences, but an easy way to do it is simply to have a bunch of “end words” where the transition function gives the empty distribution.
• Sample repeatedly, and format the sequences of words back into sentences.
• Hilarity!

OMG let’s do it.

All our probability distributions thus far have been immutable and we should continue that tradition. I’ll use the builder pattern to construct immutable objects. We’ll start by making a categorical distribution builder, and then make a Markov builder from that:

sealed class DistributionBuilder<T>
{
private Dictionary<T, int> weights = new Dictionary<T, int>();
public void Add(T t)
{
weights[t] = weights.GetValueOrDefault(t) + 1;
}
public IDistribution<T> ToDistribution()
{
var keys = weights.Keys.ToList();
var values = keys.Select(k => weights[k]);
return keys.ToWeighted(values);
}
}

Aside: Recall that we do not need to worry that we’ve closed over a member that is a mutable collection here, because `ToWeighted` is going to call `ToList` on the values.

Our Markov process builder is only slightly more complicated:

sealed class MarkovBuilder<T>
{
private DistributionBuilder<T> initial =
new DistributionBuilder<T>();
private Dictionary<T, DistributionBuilder<T>> transitions =
new Dictionary<T, DistributionBuilder<T>>();
public void AddInitial(T t)
{
}
public void AddTransition(T t1, T t2)
{
if (!transitions.ContainsKey(t1))
transitions[t1] = new DistributionBuilder<T>();
}
public Markov<T> ToDistribution()
{
var init = initial.ToDistribution();
var trans = transitions.ToDictionary(
kv => kv.Key,
kv => kv.Value.ToDistribution());
return Markov<T>.Distribution(
init,
k => trans.GetValueOrDefault(k, Empty<T>.Distribution));
}
}

Super! Now all we need is the complete works of Shakespeare, thank you Project Gutenberg. We don’t need to load the whole five megs into memory at once; we can process it line-at-a-time thanks to our builder. We’ll start by splitting up the line stream into a word stream:

static readonly char[] punct =
“<>,*-()[#]@:%\”/’;_&}”.ToCharArray();
static IEnumerable<string> Words(
this
IEnumerable<string> lines) =>
from line in lines
from word in line.Split()
let raw = word.Trim(punct)
where raw != “”
select raw;

Exercise: This is a pretty crappy word-splitting algorithm. Basically I’m stripping out any leading or trailing punctuation but I’m leaving in the periods, exclamation marks and question marks. This means that I’m stripping out the trailing single quotes in Shakespearean contractions like th’.

I’m also not canonicalizing the casing; I’ll be treating “The” and “the” as different words, and so on. This is not necessarily the best approach.

Can you come up with a better word-splitting algorithm than this one? This will do for my silly purposes, but it would be nice to have a better algorithm.

A generalized “split a sequence into subsequences” operator is a handy tool to have that is not included in the LINQ sequence operators; let’s make one:

public static IEnumerable<List<T>> Split<T>(
this IEnumerable<T> items,
Func<T, bool> last)
{
var list = new List<T>();
foreach (T item in items)
{
if (last(item))
{
yield return list;
list = new List<T>();
}
}
if (list.Any())
yield return list;
}

And now making a sentence-extractor is straightforward:

static IEnumerable<List<string>> Sentences(
this IEnumerable<string> words) =>
words.Split(w => {
char c = w[w.Length  1];
return c == ‘.’ || c == ‘!’ || c == ‘?’;
});

Again, not great; this would make “Mr.” the end of a sentence for instance. Notice again that I’m leaving the punctuation in the word. Perhaps not the best choice but it will work for our purposes.

Put it all together:

var builder = new MarkovBuilder<string>();
var sentences = File.ReadLines(“shakespeare.txt”)
.Words()
.Sentences();
foreach (var sentence in sentences)
{
for (int i = 0; i < sentence.Count  1; i += 1)
builder.AddTransition(sentence[i], sentence[i + 1]);
}
var markov = builder.ToDistribution();

Excellent. And now, release the Markovian monkey and we’ll see if Hamlet emerges:

Console.WriteLine(markov
.Samples()
.Take(100)
.Select(x => x.SpaceSeparated())
.NewlineSeparated());

Here’s a slightly-cleaned-up-for-legibility sample output; I’ve moved around the line breaks to make it easier to read.

SCENE I.

POET.

A very echo with him there. I say away!

HECTOR.

You know the body. Sir by my love him an end.
But your king? Sir John of a purse I grant
that thou brought forth in them with ears.
What’s the rebels with him. Well and others
orchards grew together.

BERTRAM.

Tis time his horse?

Dost thou seek how well but a shame this is
but with the Duke of Eve’s legacy be
the counter-gate which bears will stand
till he hath great world shall be a woman
with my ever is to cunning bade him
for recordation to death.

Exit GLOUCESTER

Re-enter Ghost.

CYMBELINE.

Grim-visag’d war but since thou turn the morn-dew
on this wretched soul elected him fair Helen
then let all our actors are you?

Now are both mine and said well his pitcher
by the heat of mine own desert?

What’s o’clock? I may not you do fight with you.
The mother breeds suspicion hath! Be soon shot.

Exeunt ACT III.

And I think I will leave it there.

It’s not Hamlet, to be sure, but it’s a start.

I also tried the complete works of Jane Austen:

The principal part of the result of your own line from any reply to the ecstasies of meeting he gravely but it in Hertfordshire and I mean to declare I have the days brought you must not. You must be happier than in Highbury with awful object as a greater steadiness had reached the picture of consequence to any objection to each other day came out.

Just… wow.

Next time on FAIC: We’ll turn our attention back to continuous distributions.

# Fixing Random, part 24

[Code for this episode is here.]

So far in this series we’ve very briefly looked at continuous distributions on doubles, and spent a lot of time looking at discrete distributions with small supports. Let’s take a look at a completely different kind of distribution that doesn’t easily fit into our “small discrete” bucket but isn’t continuous either. Some motivating scenarios:

• Imagine you find yourself placed at random in a city and you want to take a stroll. At every intersection you make a random choice of which direction to go next. The random choice of starting points has a particular distribution, and the distribution of possible direction choices could be different at each intersection. The result of each stroll is a path through the city. What is the distribution of possible strolls?
• Imagine you find yourself at a random Wikipedia/tvtropes/whatever page, and you want to go down that rabbit hole. On each page you randomly select a link to another page within the site (if there is one.) What is the distribution of possible sequences of pages?
• Imagine you are a gambler who starts with a certain random amount of cash in your pocket and a target for how much you want to win and how much you are willing to lose. If you’ve won or lost too much, you go home; otherwise, you make a random-sized bet on a slot machine with certain odds, where the distribution of bet sizes and machine choices is a function of how much money you have in your pocket right now. What is the distribution of possible sequences of cash in your pocket?

Each of these cases has a number of things in common:

• We start in a randomly chosen state.
• We randomly move to a new state depending solely on our current state, and not on any “memory” of how we got to that state.
• The new state might be a “stop state”, in which case the sequence of states is over.
• The question is: what is the distribution of possible state sequences?

A random process that has these properties is called a Markov process, after Russian mathematician Andrey Markov. Such processes are extremely useful and have been heavily studied for the last hundred years or so.

Aside: For this series I’m only going to look at Markov processes where each “state change” happens at no particular time; all we care about is the sequence. There is a whole subfield of studying Markov processes where we care about how long the gap is between events, and what the distribution of those gaps is; I’m not going to go there.

Aside: Note that there is no requirement in a Markov process that any of the distributions involved be discrete. The examples I gave all involved discrete quantities: intersections, web pages, dollar amounts. But you could certainly do a Markov process where you randomly chose a point on a line via some continuous distribution, and then chose the next point based on a continuous distribution that is a function of the current position. We will look at examples of such in later episodes; for now though we’ll stick to discrete examples.

Aside: Note that there is no requirement whatsoever that the sequence sampled from a Markov process be finite! It can be finite or infinite, depending on the details of the process.

Aside: Suppose we produce a sequence of dice rolls. We roll a die, and no matter what we rolled, we roll it again, every time. Technically that is a Markov process; the next roll depends on nothing, not even the previous roll. We will not be considering such “degenerate” Markov processes in this series; we’re interested in processes where the distribution of next states depends on the current state.

Let’s temporarily leave aside our discrete weighted distributions and go back to our underlying unweighted `IDistribution<T>` interface for this one. (Working out the “weight” associated with a particular sequence is an interesting problem, but not one that we’re going to cover in this series.) Here we have a distribution on sequences of `T`:

sealed class Markov<T> : IDistribution<IEnumerable<T>>
{
private readonly IDistribution<T> initial;
private readonly Func<T, IDistribution<T>> transition;
public static Markov<T> Distribution(
IDistribution<T> initial,
Func<T, IDistribution<T>> transition) =>
new Markov<T>(initial, transition);
private Markov(
IDistribution<T> initial,
Func<T, IDistribution<T>> transition)
{
this.initial = initial;
this.transition = transition;
}
public IEnumerable<T> Sample()
{
var current = this.initial;
while(true)
{
if (current is Empty<T>)
break;
var s = current.Sample();
yield return s;
current = this.transition(s);
}
}
}

Aside: You probably noticed that the state transition function is just another likelihood function: given the current state, it tells you what next states are most likely.

Let’s look at a simplified version of our “gambler’s ruin” example.

• Our state is the current amount of money we have: an integer.
• We need a distribution on initial conditions. Let’s say we’ll start off with exactly `n` dollars every time, so that’s a singleton distribution.
• If the state ever gets to zero or to twice `n`, we quit. We represent quitting by producing an empty distribution. We cannot produce any more states because we cannot sample from the empty distribution!
• If we are not in a quitting state then we flip a coin. This causes our state to transition to either the current bankroll minus one, or plus one.

const int n = 5;
var initial = Singleton<int>.Distribution(n);
IDistribution<int> transition(int x) =>
(x == 0 || x == 2 * n) ?
Empty<int>.Distribution :
Flip().Select(b => b ? x  1 : x + 1);

All right, let’s take it for a spin:

var markov = Markov<int>.Distribution(initial, transition);
Console.WriteLine(markov.Sample().CommaSeparated());

The first time I ran this I got:

```5,4,3,2,3,4,5,4,3,4,3,4,5,4,3,4,5,6,5,
6,5,6,5,6,5,6,5,4,5,6,7,8,9,8,9,8,9,10```

It took 38 flips to get from 5 to 10. (And we ended up back at the start nine times!)

Exercise: Should the length of this sequence surprise you? Is it longer than average, shorter than average, or does it seem like it should be a pretty typical sample from this distribution? The solution is below.

.

.

.

.

.

Let’s think about the possible lengths of sequences. A few things come to mind:

• The shortest possible games are 5, 4, 3, 2, 1, 0 and 5, 6, 7, 8, 9, 10, so we won’t expect to see anything below six.
• Every game must have an even number of items in the sequence; do you see why?
• The maximum length is… well, there is no maximum length. We could go 5, 6, 5, 6, 5, 6, … arbitrarily long before heading off to 10 or 0.

The easiest way to get a handle on typical sequence lengths to just draw the histogram:

markov.Samples().Select(x => x.Count()).DiscreteHistogram()

```  6|********************************
8|**************************************
10|****************************************
12|************************************
14|********************************
16|*****************************
18|****************************
20|**************************
22|***********************
24|*******************
26|*******************
28|******************
30|****************
32|***************
34|***********
36|**********
38|**********
40|********
42|********
44|********
46|******
48|******
50|******
52|****
54|****
56|***
58|****
60|***
62|**
64|**
66|**
68|***
70|**
72|*
```

I cut it off there; the “long tail” in this particular run actually went out to 208!

We can see from the histogram that the mode is ten flips and most games end in fewer than 38 flips, but there is still a substantial percentage that take longer. We could work out exactly how much, but I think I will leave it there for now. You get the idea.

We seem to be doing pretty well here; we’ve created an entire new class of distributions that we can explore using fluent programming in only a few lines of code.

Next time on FAIC: It takes only one Markovian monkey to produce an approximation of Hamlet.

# Fixing Random, part 23

So… I’ve got good news and bad news.

The good news is:

• I’ve described an interface for discrete probability distributions and implemented several distributions.
• I’ve shown how projecting a distribution is logically equivalent to the LINQ `Select` operator.
• I’ve shown how conditional probabilities of the form P(B|A) “probability of B given A” are likelihood functions.
• I’ve shown that combining a prior with a likelihood to form a joint distribution is the application of the monadic bind operator on the probability monad, and implemented this as `SelectMany`.
• I’ve shown that computing the posterior distribution from the joint distribution can be done by applying a `Where` clause.
• I’ve sketched out a statement-based workflow DSL that can be compiled down to our LINQ operators, and that lowering and then executing programs written in this DSL can compute a “sampling-loop-free” distribution automatically.
• And that distribution object can then be efficiently sampled.

This is all great, if I do say so myself. But the bad news is that there are some problems; let’s list them in order roughly from smallest to largest:

First problem: It’s often inconvenient to express weights as integers, particularly when you end up with largish coprime weights; we’re risking integer overflow when we multiply or add them, and we lack facilities for common operations like “make me a Bernoulli distribution based on this arbitrary double”.

That criticism is very valid, but the problem is totally fixable. I chose integer weights because it is easy to get integer arithmetic right in simple cases and we do not require any additional library support. A realistic implementation of a weighted discrete distribution would probably choose arbitrary-precision integers, arbitrary-precision rationals or doubles depending on its needs.

Using doubles slightly complicates the math because you either have to make sure that probabilities add up to 1.0 even when double representation error bites you, or you have to deal with situations where the total weight is not 1.0. The latter problem is not so bad; we can always divide through by the total weight if we need to. We’ll return to the more general problem of dealing with double-weighted distributions that do not add up to 1.0 later in this series.

Aside: Since probabilities are always between 0 and 1, and often very small, a common choice is to take the log of the probability as the weight. In that scheme we can implement multiplying two probabilities together as adding, which is nice, but lots of other operations get harder. This is also a nice representation if you’re doing math that involves computations of entropy, as entropy is defined as the negative log of the mass function. We won’t go there in this series.

Second problem: The basic idea of both our LINQ-based and statement-based implementations is to take a workflow that might contain conditions — `where` clauses, in the query form, `condition` statements in the statement form — and infer a distribution object that can be sampled from efficiently.

But our technique for that is for `SelectMany` to produce every possible outcome, and then group them to determine the weights! For example, suppose we had a workflow like

var d10 = SDU.Distribution(1, 10);
var sum = from a in d10
from b in d10
from c in d10
from d in d10
let s = a + b + c + d
where s >= 20
select s;

If these were sequences instead of distributions, we’d know what happens here: we enumerate all ten thousand possible combinations, run a filter over their sum, and then yield a whole sequence of sums that match the filter.

But that’s almost exactly what we’re doing with the distributions too! Our query here is basically a syntactic sugar for

var d10 = SDU.Distribution(1, 10);
var sum = from a in d10.Support()
from b in d10.Support()
from c in d10.Support()
from d in d10.Support()
let s = a + b + c + d
where s >= 20
group s …

and you know how this goes from our implementation of `SelectMany`on distributions: we weight each member, group them by outcome, sum up the weights, and so on. The math at the end isn’t the expensive part; the number of nested loops we go through to get to that math is!

Now imagine if this were a much more complex workflow, whether written in our query form or our statement form. Inferring the result of the workflow basically means evaluating every possible value for every local/range variable and recording the result. If there are lots of possible values for lots of variables, we could end up exploring millions or billions of possibilities, some of which are very low probability.

Now, the benefit here is of course that we only do that exploration once, and the probability distribution that comes out the other end is then cheap to use forever; we can memoize it, because by assumption everything is pure. But still, it seems odd that we’re implementing inference by running every possible path through the workflow and recording the results; it seems like we could trade off a small amount of inaccuracy for a great improvement in speed.

Put another way: in our original naive implementations of `Where` and `SelectMany` we didn’t do any inference; we just sampled the underlying distributions and lived with the fact that `Where` could loop a lot. Compare constructing the naive implementation to constructing our current implementation:

• Constructing the naive version of the distribution is O(n) in the size of the number of query operators. In this case, we’ve got a tiny handful of `SelectMany`s and a `Where`. The number of query operators is usually very small.
• Constructing the exact distribution is O(n) in the number of possible combinations of all variables, in this case, ten thousand. Those ten thousand computations then produce a correct weighted integer distribution.

And compare sampling:

• Sampling the naive version would ultimately do four samples from d10, add them up, and reject the resulting sum a little bit less than half of the time. On average we would do around seven or eight samples of underlying distributions per one sample of the sum distribution.
• Sampling the exact version is just sampling from a weighted integer, which we know does exactly two samples from its implementation distributions.

And compare knowledge of the weights:

• The naive version can compute weights cheaply for `Where` conditioned distributions, but not distributions that have `SelectMany` clauses; it’s computing the weights that is expensive.
• The exact version computes all weights exactly for all workflows.

Which way is the win? Consider these scenarios:

• We’re never going to call `Sample()`, but we want an accurate inference of the true weights of outcomes given a workflow. When we’re computing the posterior distribution of “is this email a spam?” we don’t care to ever sample from that distribution; we want to know is it 99% or 1%?
• We’re going to sample millions or billions of times, so keeping the per-sample cost as low as possible is important.
• We are sampling and a `Where` clause rejects the vast majority of possible outcomes, causing long waits for a sample in the naive case.

In those three scenarios, our current “exact but expensive inference” step is clearly better. But if those conditions are not met, then the current version is doing a lot of up-front work in order to save possibly very little work later.

Summing up:

• It is not immediately obvious where to draw the line that indicates whether the inference optimization’s high up-front cost is worth the benefit of cheap sampling later.
• It seems plausible that we could trade off a small amount of inaccuracy in the inference for large speed improvements in inference.

Third problem: the biggest problem with this system is: in practice it only works for discrete distributions with small supports!

I started this series by making a general-purpose distribution type for discrete or continuous distributions, waved my hands a bit about continuous distributions, and then abandoned them entirely. But they are important!

Moreover, we can’t easily work with discrete distributions where any integer is a possible output, like the geometric distribution or the Poisson distribution. Sure, we can approximate them by restricting their ranges to something reasonable, but it would be nice to be able to just have an object that represents a Poisson distribution and not have to worry about the repercussions of the fact that technically, the support is any positive integer.

Summing up:

• I’ve spent the last 22 episodes of this series solving the easiest possible problems in this space. All the operations we care about on “categorical distributions” admit easy exact solutions to the inference problem.
• If you take away one thing from this series, well, it should be that `System.Random` is way too simple for the needs of modern programmers dealing with stochastic systems. By making a couple simple interfaces and implementations, we can represent very rich probabilistic workflows, compose them, and do math automatically.
• If you take away two things, the second thing should be that sampling from a given distribution efficiently is a really hard problem. By restricting the problem enormously, we can make good progress on it, but once we leave the realm of tiny discrete distributions, we will be in some pretty deep water.

We’ve taken small discrete distributions as far as I’m going to go in this series. I’m going to take a short break to visit my family for Easter, and then we’ll pick up when I get back. I’ll spend the remainder of this series sketching out some ways we can use the ideas we’ve developed so far to tackle harder problems in stochastic programming, but will not go into as much depth as we have so far.

Next time on FAIC: How can we use these techniques to model Markov processes?