Fixing Random, part 35

Last time on FAIC we deduced the idea behind the “importance sampling” technique for determining the average value of a function from double to double — call it  f — when it is applied to samples from a possibly-non-normalized weighted distribution of doubles — call it p.

Just for fun, I’m going to do the derivation all over again. I’ll again be using the technique “things equal to the same are equal to each other”, but this time we’re going to start from the other end. Let’s jump right in!

Again, for pedagogical purposes I’m going to consider only distributions with support from 0.0 to 1.0; we’ll eliminate that restriction when we can.

We discovered a while back that the expected value of f applied to samples of pis equal to the quotient of Area(x => f(x) * p.Weight(x) divided by ​​Area(x => p.Weight(x)). The latter term is the normalization constant for p, which we might not know.

Let’s take any weighted distribution q also with support from 0.0 to 1.0; it also might not be normalized.

We’re now going to do some manipulations to our expression that are obviously identities. We’ll start with the fact that

Area(x => f(x) * p.Weight(x))

obviously must be equal to:

Area(x => (f(x) * p.Weight(x) / q.Weight(x)) * q.Weight(x))

We’ve just divided and multiplied by the same quantity, so that is no change. And we’ve assumed that q has the same support as p, so the weight we’re dividing by is always non-zero.

Similarly,

Area(p.Weight)

must be equal to

(Area(q.Weight) * (Area(p.Weight) / Area(q.Weight)))

for the same reason.

So the quotient of these two quantities must also be equal to the expected value of f applied to samples from p; they’re the same quantities! Our original expression

Area(x => f(x) * p.Weight(x) /
 Area(x => p.Weight(x))

is equal to:

Area(x => (f(x) * p.Weight(x) / q.Weight(x)) * q.Weight(x)) / 
  (Area(q.Weight) * (Area(p.Weight) / Area(q.Weight)))

For any suitable q.

Let’s call that value exp_fp for “expected value of f on samples of p“. We’ve just written that value in two ways, one very straightforward, and one excessively complicated.

Unsurprisingly, my next question is: what is the expected value of function g

x => f(x) * p.Weight(x) / q.Weight(x)

over samples from q?

We know that it is estimated by this quotient of areas:

Area(x => g(x) * q.Weight(x)) /
Area(q.Weight)

The denominator is the normalization constant of q which we might not know.

Call that value exp_gq, for “expected value of g on samples of q

What is the relationship between exp_gq and exp_fp?

Well, just look at the two expressions carefully; plainly they differ by no more than a constant factor. exp_fp is equal to exp_gq * Area(q.Weight) / Area(p.Weight)

And now we are right where we ended last time. Summing up:

  • Once again, we have deduced that importance sampling works:
    • An estimate of the expected value of g applied to samples from q is proportional to the expected value of f applied to samples from p
    • the proportionality constant is exactly the quotient of the normalization constants of q and p
    • If q and p are known to be normalized, then that constant is 1.
  • Once again, we can extend this result to q and p with any support
    • All we really need for an accurate estimate is that q have support in places where f(x) * p.Weight(x) has a lot of area.
    • But it is also nice if q has low weight in where that function has small area.
  • Once again, we have two serious problems:
    • how do we find a good q?
    • we are required to know the normalization constants of both p and q, either of which might be hard to compute in general
  • Once again, the previous statement contains a subtle error.
    • I’m so devious.
    • I asked what the error was in the previous episode as well, and there is already an answer in the comments to that episode, so beware of spoilers if you want to try to figure it out for yourself.

We are so close to success, but it seems to be just out of our grasp. It’s vexing!


Next time on FAIC: Amazingly, we’ll implement a version of this algorithm entirely based on sampling; we do not actually need to do the quadrature to compute the normalization factors!

Fixing Random, part 34

Last time on FAIC we implemented a better technique for estimating the expected value of a function f applied to samples from a distribution p:

  • Compute the total area (including negative areas) under the function x => f(x) * p.Weight(x)
  • Compute the total area under x => p.Weight(x)
    • This is 1.0 for a normalized PDF, or the normalizing constant of a non-normalized PDF; if we already know it, we don’t have to compute it.
  • The quotient of these areas is the expected value

Essentially our technique was to use quadrature to get an approximate numerical solution to an integral calculus problem.

However, we also noted that it seems like there might still be room for improvement, in two main areas:

  • This technique only works when we have a good bound on the support of the distribution; for my contrived example I chose a “profit function” and a distribution where I said that I was only interested in the region from 0.0 to 1.0.
  • Our initial intuition that implementing an estimate of “average of many samples” by, you know, averaging many samples, seems like it was on the right track; can we get back there?

In this episode I’m going to stick to the restriction to distributions with support over 0.0 to 1.0 for pedagogic reasons, but our aim is to find a technique that gets us back to sampling over arbitrary distributions.

The argument that I’m going to make here (several times!) is: two things that are both equal to the same third thing are also equal to each other.

Recall that we arrived at our quadrature implementation by estimating that our continuous distribution’s expected value is close to the expected value of a very similar discrete distribution. I’m going to make my argument a little bit more general here by removing the assumption that p is a normalized distribution. That means that we’ll need to know the normalizing factor np, which as we’ve noted is Area(p.Weight).

We said that we could estimate the expected value like this:

  • Imagine that we create a 1000 sided “unfair die” discrete distribution.
  • Each side corresponds to a 0.001 wide slice from the range 0.0 to 1.0; let’s say that we have a variable x that takes on values 0.000, 0.001, 0.002, and so on, corresponding to the 1000 sides.
  • The weight of each side is the probability of choosing this slice: p.Weight(x) / 1000 / np
  • The value of each side is the “profit function” f(x)
  • The expected value of “rolling this die” is the sum of (value times weight): the sum of f(x) * (p.Weight(x) / 1000 / np)over our thousand values of x

Here’s the trick:

  • Consider the standard continuous uniform distribution u. That’s a perfectly good distribution with support 0.0 to 1.0.
  • Consider the function w(x) which is x => f(x) * p.Weight(x) / np.  That’s a perfectly good function from double to double.
  • Question: What is an estimate of the expected value of w over samples from u?

We can use the same technique:

  • Imagine we create a 1000-sided “unfair die” discrete distribution
  • x is as before
  • The weight of each side is the probability of choosing that slice, but since this is a uniform distribution, every weight is the same — so, turns out, it is not an unfair die! The weight of each side is 0.001.
  •  The value of each side is our function w(x)
  • The expected value of rolling this die is the sum of (value times weight): the sum of (f(x) * p.Weight(x) / np) * 0.001 over our thousand values of x

But compare those two expressions; we are computing exactly the same sum both times. These two expected values must be the same value.

Things equal to the same are equal to each other, which implies this conclusion:

If we can compute an estimate of the expected value of w applied to samples from u by any technique then we have also computed an estimate of the expected value of f applied to samples from p.

Why is this important?

The problem with the naïve algorithm in our original case was that there was a “black swan” — a region of large (negative) area that is sampled only one time in 1400 samples. But that region is sampled one time in about 14 samples when we sample from a uniform distribution, so we will get a much better and more consistent estimate of the expected value if we use the naïve technique over the uniform distribution.

In order to get 100x times as many samples in the black swan region, we do not have to do 100x times as many samples overall. We can just sample from a helper distribution that targets that important region more often.

Let’s try it! (Code can be found here.)

In order to not get confused here, I’m going to rename some of our methods so that they’re not all called ExpectedValue. The one that just takes any distribution and averages a bunch of samples is now ExpectedValueBySampling and the one that computes two areas and takes their quotient is ExpectedValueByQuadrature.

var p = Normal.Distribution(0.75, 0.09);
double f(double x) => Atan(1000 * (x  .45)) * 20  31.2;
var u = StandardContinuousUniform.Distribution;
double np = 1.0; // p is normalized
double w(double x) => f(x) * p.Weight(x) / np;
for (int i = 0; i < 100; ++i)
  Console.WriteLine($”{u.ExpectedValueBySampling(w):0.###});

Remember, the correct answer that we computed by quadrature is 0.113. When sampling p directly we got values ranging from 0.7 to 0.14. But now we get:

0.114, 0.109, 0.109, 0.118, 0.111, 0.107, 0.113, 0.112, ...

So much better!

This is awesome, but wait, it gets more awesome. What is so special about the uniform distribution? Nothing, that’s what. I’m going to do this argument one more time:

Suppose I have distribution q, any distribution whatsoever, so long as its support is the same as p — in this case, 0.0 to 1.0. In particular, suppose that q is not necessarily a normalized distribution, but that again, we know its normalization factor. Call it nq.  Recall that the normalization factor can be computed by nq = Area(q.Weight).

Our special function g(x) is this oddity:

x => (f(x) * (p.Weight(x) / q.Weight(x)) * (nq / np)

What is the expected value of g over distribution q?  One more time:

  • Create a 1000-sided unfair die, x as before.
  • The weight of each side is the probability of choosing that side, which is (q.Weight(x) / 1000) / nq
  • The value of each side is g(x).
  • The expected value is the sum of g(x) * (q.Weight(x) / 1000) / nq but if you work that out, of course that is the sum of f(x) * p.Weight(x) / np / 1000

And once again, we’ve gotten back to the same sum by clever choice of function. If we can compute the expected value of gevaluated on samples from q, then we know the expected value of f evaluated on samples from p!

This means that we can choose our helper distribution q so that it is highly likely to pick values in regions we consider important. Let’s look at our graph of p.Weight and f*p.Weightagain:

Screen Shot 2019-05-21 at 9.17.06 AM

There are segments of the graph where the area under the blue line is very small but the area under the orange line is large, and that’s our black swan; what we want is a distribution that samples from regions where the orange area is large, and if possible skips regions where it is small. That is, we consider the large-area regions important contributors to the expected value, and the small-area regions unimportant contributors; we wish to target our samples so that no important regions are ignored. That’s why this technique for computing expected value is called “importance sampling”.


Exercise: The uniform distribution is pretty good on the key requirement that it never be small when the area under the orange line is large, because it is always the same size from 0.0 to 1.0; that’s why it is the uniform distribution, after all. It’s not great on our second measure; it spends around 30% of its time sampling from regions where the orange line has essentially no area under it.

Write some code that tries different distributions. For example, implement the distribution that has weight function x => (0 <= x && x <= 1) ? x : 0

(Remember that this is not a normalized distribution, so you’ll have to compute nq.)

Does that give us an even better estimate of the expected value of f?

Something to ponder while you’re working on that: what would be the ideal choice of distribution?


Summing up:

  • Suppose we have a weighted distribution of doubles p and a function from double to double f.
  • We wish to accurately estimate the average value of f when it is applied to a large set of samples from p; this is the expected value.
  • However, there may be “black swan” regions where the value of f is important to the average, but the probability of sampling from that region is low, so our average could require a huge number of samples to get an accurate average.
  • We can fix the problem by choosing any weighted distribution q that has the same support as pbut is more likely to sample from important regions.
  • The expected value of g (given above) over samples drawn from q is the same as the expected value of fover samples from p.

This is great, and I don’t know if you noticed, but I removed any restriction there that p or q be distributions only on 0.0 to 1.0; this technique works for weighted distributions of doubles over any support!


Aside: We can weaken our restriction that q have the same support as p; if we decide that q can have zero weight for particularly unimportant regions, where, say, we know that f(x)*p.Weight(x) is very small, then that’s still going to produce a good estimate.

Aside: Something I probably should have mentioned before is that all of the techniques I’m describing in this series for estimating expected values require that the expected value exists! Not all functions applied to probability distributions have an expected value because the average value of the function computed on a group of samples might not converge as the size of the group gets larger. An easy example is, suppose we have a standard normal distribution as our p​ and x => 1.0 / p.Weight(x) as our f. The more samples from p we take, the more likely it is that the average value of f gets larger!


However, it’s not all sunshine and roses. We still have two problems, and they’re pretty big ones:

  • How do we find a good-quality q distribution?
  • We need to know the normalization constants for both distributions. If we do not know them ahead of time (because, say, we have special knowledge that the continuous uniform distribution is normalized) then how are we going to compute them?  Area(p.Weight) or Area(q.Weight) might be expensive or difficult to compute.It seems like in the general case we still have to solve the calculus problem. 😦

Aside: The boldface sentence in my last bullet point contains a small but important error. What is it? Leave your guesses in the comments; the answer will be in an upcoming episode.


I’m not going to implement a general-purpose importance sampling algorithm until we’ve made at least some headway on these remaining problems.

Next time on FAIC:  It’s Groundhog Day! I’m going to do this entire episode over again; we’re going to make a similar argument — things equal to the same are equal to each other — but starting from a different place. We’ll end up with the same result, and deduce that importance sampling works.

 

Fixing Random, part 33

Last time on FAIC I showed why our naïve implementation of computing the expected value can be fatally flawed: there could be a “black swan” region where the “profit” function f is different enough to make a big difference in the average, but the region is just small enough to be missed sometimes when we’re sampling from our distribution p

It looks so harmless [Photo credits]

The obvious solution is to work harder, not smarter: just do more random samples when we’re taking the average! But doesn’t it seem to be a little wasteful to be running ten thousand samples in order to get 9990 results that are mostly redundant and 10 results that are extremely relevant outliers?

Perhaps we can be smarter.

We know how to compute the expected value in a discrete non-uniform distribution of doubles: multiply each value by its weight, sum those, and divide by the total weight. But we should think for a moment about why that works.

If we have an unfair two-sided die — a coin — stamped with value 1.23 on one side, and -5.87 on the other, and 1.23 is twice as likely as -5.87, then that is the same as a fair three sided die — whatever that looks like — with 1.23 on two sides and -5.87 on the other. One third of the time we get the first 1.23, one third of the time we get the second 1.23, and one third of the time we get -5.87, so the expected value is 1.23/3 + 1.23/3 – 5.87/3, and that’s equivalent to (2 * 1.23 – 5.87) / 3. This justifies our algorithm.

Can we use this insight to get a better estimate of the expected value in the continuous case? What if we thought about our continuous distribution as just a special kind of discrete distribution?


Aside: In the presentation of discrete distributions I made in this series, of course we had integer weights. But that doesn’t make any difference in the computation of expected value; we can still multiply values by weights, take the sum, and divide by the total weight.


Our original PDF — that shifted, truncated bell curve — has support from 0.0 to 1.0. Let’s suppose that instead we have an unfair 1000-sided die, by dividing up the range into 1000 slices of size 0.001 each.

  • The weight of each side of our unfair die is the probability of rolling that side.
    • Since we have a normalized PDF, the probability is the area of that slice. 
    • Since the slices are very thin, we can ignore the fact that the top of the shape is not “level”; let’s just treat it as a rectangle.
    • The width is 0.001; the height is the value of the PDF at that point.
    • That gives us enough information to compute the area.
  • Since we have a normalized PDF, the total weight that we have to divide through is 1.0, so we can ignore it. Dividing by 1.0 is an identity.
  • The value on each side of the die is the value of our profit function at that point.

Now we have enough information to make an estimate of the expected value using our technique for discrete distributions.


Aside: Had I made our discrete distributions take double weights instead of integer weights, at this point I could simply implement a “discretize this distribution into 1000 buckets” operation that turns weighted continuous distributions into weighted discrete distributions.

However, I don’t really regret making the simplifying choice to go with integer weights early in this series; we’re immediately going to refactor this code away anyways, so turning it into a discrete distribution would have been a distraction.


Let’s write the code: (Code for this episode is here.)

 public static double ExpectedValue(
    this IWeightedDistribution<double> p,
    Func<double, double> f) =>
  // Let’s make a 1000 sided die:
  Enumerable.Range(0, 1000)
  // … from 0.0 to 1.0:
  .Select(i => ((double)i) / 1000)
  // The value on the “face of the die” is f(x)
  // The weight of that face is the probability
// of choosing this slot

  .Select(x => f(x) * p.Weight(x) / 1000)
  .Sum();
  // No need to divide by the total weight since it is 1.0.

And if we run that:

Console.WriteLine($”{p.ExpectedValue(f):0.###});

we get

0.113

which is a close approximation of the true expected value of this profit function over this distribution. Total success, finally!

Or, maybe not.

I mean, that answer is correct, so that’s good, but we haven’t solved the problem in general.

The obvious problem with this implementation is: it only works on normalized distributions whose support is between 0.0 and 1.0. Also, it assumes that 1000 is a magic number that always works. It would be nice if this worked on non-normalized distributions over any range with any number of buckets.

Fortunately, we can solve these problems by making our implementation only slightly more complicated:

public static double ExpectedValue(
  this IWeightedDistribution<double> p,
  Func<double, double> f,
  double start = 0.0,
  double end = 1.0,
  int buckets = 1000)
{
  double sum = 0.0;
  double total = 0.0;
  for (int i = 0; i < buckets; i += 1)
  {
    double x = start + (end  start) * i / buckets;
    double w = p.Weight(x) / buckets;
    sum += f(x) * w;
    total += w;
  }
  return sum / total;
}

That totally works, but take a closer look at what this is really doing. We’re computing two sums, sum and total, in exactly the same manner. Let’s make this a bit more elegant by extracting out the summation into its own method:

public static double Area(
    Func<double, double> f,
    double start = 0.0,
    double end = 1.0,
    int buckets = 1000) =>
  Enumerable.Range(0, buckets)
    .Select(i => start + (end  start) * i / buckets)
    .Select(x => f(x) / buckets)
    .Sum();
public static double ExpectedValue(
    this IWeightedDistribution<double> p,
    Func<doubledouble> f,
    double start = 0.0,
    double end = 1.0,
    int buckets = 1000) =>
  Area(x => f(x) * p.Weight(x), start, end, buckets) /
    Area(p.Weight, start, end, buckets);

As often happens, by making code more elegant, we gain insights into the meaning of the code. That first function should look awfully familiar, and I’ve renamed it for a reason. The first helper function computes an approximation of the area under a curve; the second one computes the expected value as the quotient of two areas. 

It might be easier to understand it with a graph; here I’ve graphed the distribution p.Weight(x) as the blue line and the profit times the distribution f(x) * p.Weight(x)as the orange line:

Screen Shot 2019-05-21 at 9.17.06 AM.png

The total area under the blue curve is 1.0; this is a normalized distribution.

The orange curve is the blue curve multiplied by the profit function at that point. The total area under the orange curve — remembering that area below the zero line is negative area — divided by the area of the blue curve (1.0) is the expected value.


Aside: You can see from the graph how carefully I had to contrive this “black swan” scenario. I needed a region of the graph where the area under the blue line is close to 0.001 and the profit function is so negative that it makes a large negative area there when multiplied, but without making a large negative area anywhere else.

Of course this example is contrived, but it is not unrealistic; unlikely things happen all the time, and sometimes those unlikely things have important consequences.

An interesting feature of this scenario is: look at how wide the negative region is! It looks like it is around 10% of the total support of the distribution; the problem is that we sample from this range only 0.1% of the time because the blue line is so low here. We’ll return to this point in the next episode.


Aside: The day I wrote this article I learned that this concept of computing expected value of a function applied to a distribution by computing the area of the product has a delightful name: LOTUS, which stands for the Law Of The Unconscious Statistician.

The tongue-in-cheek name is apparently because statistics students frequently believe that “the expected value is the area under this curve” is the definition of expected value. I hope I avoided any possible accusation of falling prey to the LOTUS fallacy. We started with a correct definition of expected value: the average value of a function applied to a bunch of samples, as the size of the bunch gets large. I then gave an admittedly unrigorous, informal and hand-wavy justification for computing it by approximating area, but it was an argument.


We’ve now got two ways of computing an approximation of the expected value when given a distribution and a function:

  • Compute a bunch of samples and take their average.
  • Compute approximate values of two areas and divide them.

As we know, the first has problems: we might need a very large set of samples to find all the relevant “black swan” events, and therefore we spend most of our time sampling the same boring high-probability region over and over.

However, the second has some problems too:

  • We need to know the support of the distribution; in my contrived example I chose a distribution over 0.0 to 1.0, but of course many distributions have much larger ranges.
  • We need to make a guess about the appropriate number of buckets to get an accurate answer.
  • We are doing a lot of seemingly unnecessary math; between 0.0 and, say 0.3, the contribution of both the blue and orange curves to the total area is basically zero. Seems like we could have skipped that, but then again, skipping a region with total probability of one-in-a-thousand led to a bad result before, so it’s not entirely clear when it is possible to save on work.
  • Our first algorithm was fundamentally about sampling, which seems appropriate, since “average of a set of samples” is the definition of expected value. This algorithm is just doing an approximation of integral calculus; something seems “off” about that.

It seems like we ought to be able to find an algorithm for computing expected value that is more accurate than our naive algorithm, but does not rely so much on calculus.


Next time on FAIC: We’ll keep working on this problem!

Fixing Random, part 32

Last time on FAIC we reviewed the meaning of “expected value”: when you get a whole bunch of samples from a distribution, and a function on those samples, what is the average value of the function’s value as the number of samples gets large?

I gave a naive implementation:

public static double ExpectedValue<T>(
    this IDistribution<T> d,
    Func<T, double> f) =>
  d.Samples().Take(1000).Select(f).Average();

public static double ExpectedValue(
    this IDistribution<double> d) =>
  d.ExpectedValue(x=>x);

Though short and sweet, this implementation has some problems; the most obvious one is that hard-coded 1000 in there; where did it come from? Nowhere in particular, that’s where.

It seems highly likely that this is either too big, and we can compute a good estimate in fewer samples, or that it is too small, and we’re missing some important values.

Let’s explore a scenario where the number of samples is too small. The scenario will be contrived, but not entirely unrealistic.

Let’s suppose we have an investment strategy; we invest a certain amount of money with this strategy, and when the strategy is complete, we have either more or less money than we started with. To simplify things, let’s say that the “outcome” of this strategy is just a number between 0.0 and 1.0; 0.0 indicates that the strategy has completely failed, resulting in a loss, and 1.0 indicates that the strategy has completely succeeded, resulting in the maximum possible return.

Before we go on, I want to talk a bit about that “resulting in a loss” part. If you’re a normal, sensible investor like me and you have $100 to invest, you buy a stock or a mutual fund for $100 because you believe it will increase in value. If it goes up to $110 you sell it and pocket the $10 profit. If it goes down to $90, you sell it and take the $10 loss. But in no case do you ever lose more than the $100 you invested. (Though of course you do pay fees on each trade whether it goes up or down; let’s suppose those are negligible.) Our goal is to get that 10% return on investment.

Now consider the following much riskier strategy for spending $100 to speculate in the market: suppose there is a stock currently at $100 which we believe will go down, not up. We borrow a hundred shares from someone willing to lend them to us, and sell them for $10000. We pay the lender $100 interest for their trouble. Now if the stock goes down to $90, we buy the hundred shares back for $9000, return them to the owner, and we’ve spent $100 but received $1000; we’ve made a 900% return on the $100 we “invested”. This is a “short sale”, and as you can see, you get a 900% return instead of a 10% return.

(I say “invested” in scare quotes because this isn’t really investing; it’s speculation. Which is a genteel word for “gambling”.)

But perhaps you also see the danger here. Suppose the stock goes down but only to $99.50. We buy back the shares for $9950, and we’ve only gained $50 on the trade, so we’ve lost half of our $100; in the “long” strategy, we would only have lost fifty cents; your losses can easily be bigger with the short strategy than with the long strategy.

But worse, what if the stock goes up to $110. We have to buy back those shares for $11000, so we’ve “invested” $100 and gotten a “return” of -$1000, for a total loss of $1100 on a $100 investment. In a short sale if things go catastrophically wrong you can end up losing more than your original investment. A lot more!

As I often say, foreshadowing is the sign of a quality blog; let’s continue with our example.

We have a process that produces values from 0.0 to 1.0 that indicates the “success level” of the strategy. What is the distribution of success level? Let’s suppose it’s a straightforward bell-shaped curve with the mean on the “success” side:

Screen Shot 2019-05-20 at 1.14.22 PM.png

This is a PDF, and extra bonus, it is normalized: the total area under the curve is 1.0. The vast majority of the time — 99.9% — our investment strategy produces an outcome between 0.48 and 1.0, so that’s the area of the graph I’m going to focus on.


Aside: I know it looks a little weird seeing a bell-shaped curve that just cuts off abruptly at 1.0, but let’s not stress about it. Again, this is a contrived example, and we’ll see why it is irrelevant later.


Now, I don’t intend to imply that the “success level” is the return: I’m not saying that most of the time we get a 75% return. Let’s suppose that we’ve worked out a function that translates “strategy outcome level” to the percentage gain on our investment. That is, if the function is 0.0, we’ve not gained anything but we’ve not lost anything either. If it is 0.5 then our profits are 50% of our investment; if it’s -0.25 then we’ve taken a net loss of 25% of our investment, and so on.

Let’s propose such a function, and zoom in on the right side of the diagram where the 99.9% of cases are found: from 0.46 to 1.0. Here’s our function:

Screen Shot 2019-05-20 at 1.34.22 PM.png

If our process produces an outcome of 0.54 or larger, we make between a 0% and an 18% return, which seems pretty good; after all the vast majority of the bulk of the distribution is to the right of 0.54. Now, in the unlikely case where we get an outcome from 0.48 to 0.54, we lose money; on the left hand end, we’re losing almost 200%; that is, if we put in $100 we not only fail to make it back, we end up owing $100.

The question at hand, of course, is what is the average value of the “profit function” given the distribution of outcomes? That is, if we ran this strategy a thousand times, on average what return would we get?

Well, we already wrote the code, so let’s run it: (Code can be found here.)

var p = Normal.Distribution(0.75, 0.09);
double f(double x) =>
  Atan(1000 * (x  .45)) * 20  31.2;
Console.WriteLine($”{p.ExpectedValue(f):0.##});


Aside: I’m ignoring the fact that if we get a sample out of the normal distribution that is greater than 1.0 or less than 0.0 we do not discard it. The region to the left of 0.0 is absurdly unlikely, and the region to the right of 1.0 has low probability and the value of the profit function is very flat. Writing a lot of fiddly code to correct for this problem would not change the estimated expected value much and would be a distraction from the far deeper problem we’re about to discover.

Also, I’m limiting the distribution to a support of 0.0 to 1.0 for pedagogical reasons; as we’ll see in later episodes, we will eventually remove this restriction, so let’s just ignore it now.


We get the answer:

0.14

Super, we will get on average a 14% return on our investment with this strategy. Seems like a good investment; though we have a small chance of losing big, the much larger chance of getting a solid 0% to 18% return outweighs it. On average.


Aside: It is important to realize that this 14% figure does not necessarily imply that typically we get a 14% return when we run this strategy, any more than the expected value of 3.5 implies that we’ll typically get 3.5 when we roll a d6. The expected value only tells you what the average behaviour is; it doesn’t tell you, for example, what the probability is that you’ll lose everything. It tells you that if you ran this strategy a million times, on average you’d get a 14% return, and that’s all it tells you. Be careful when making decisions relying upon expected values!


This 14% expected result totally makes sense, right? We know most of the time the result will be between 0% and 18%, so 14% seems like a totally reasonable expected value.

As one of my managers at Microsoft used to say: would you bet your car on it?

Let’s just run the code again to be sure.

0.14

Whew. That’s a relief.

You know what they say: doing the same thing twice and expecting different results is the definition of practicing the piano, but I’m still not satisfied. Let’s run it another hundred times:

0.14, 0.08, 0.09, 0.14, 0.08, 0.14, 0.14, 0.07, 0.07, 0.14, ...

Oh dear me.

The expected return of our strategy is usually 14%; why it is sometimes half that? Yes, there is randomness in our algorithm, but surely it should not cause the value computed to vary so widely!

It seems like we cannot rely on this implementation, but why not? The code looks right.

As it turns out, I showed you exactly the wrong part of the graphs above. (Because I am devious.) I showed you the parts that make up 99.9% of the likely cases. Let’s look at the graph of the profit-or-loss function over the whole range of outcomes from 0.0 to 1.0:

Screen Shot 2019-05-20 at 5.09.29 PM.png

OUCH. It’s an approximation of a step function; in the original graph that started at 0.46, we should have noticed that the extremely steep part of the curve was trending downwards to the left at an alarming rate. If you have an outcome of 0.48 from our process, you’re going to lose half your money; 0.46, you lose all your money and a little bit more. But by the time we get to 0.44, we’re down to losing over 60 times your original investment; this is a catastrophic outcome.

How likely is that catastrophic outcome? Let’s zoom in on just that part of the probability distribution. I’ll re-scale the profit function so that they fit on the same graph, so you can see where exactly the step happens:

2019-05-20 (3)

This is a normalized PDF, so the probability of getting a particular outcome is equal to the area of the region under the curve. The region under the curve from 0.38 to 0.45 is a little less than a triangle with base 0.07 and height 0.02, so its area is in the ballpark of 0.0007. We will generate a sample in that region roughly once every 1400 samples.

But… we’re computing the expected value by taking only a thousand samples, so it is quite likely that we don’t hit this region at all. The expected losses in that tiny area are so enormous that it changes the average every time we do sample from it. Whether you include a -60 value or not in the average is a huge influence as to whether the average is 0.14 or 0.07. 


Aside: if you run the expected value estimate a few thousand times or more it just gets crazier; sometimes the expected value is actually negative if we just by luck get more than one sample in this critical region.


I contrived this example to produce occasional “black swan events“; obviously, our naïve algorithm for computing expected value does not take into account the difficulty of sampling a black swan event, and therefore produces inaccurate and inconsistent results.


Next time on FAIC: We know how to solve this problem exactly for discrete distributions; can we use some of those ideas to improve our estimate of the expected value in this scenario?

Fixing Random, part 31

Last time in this series we saw that we could compute a continuous posterior distribution when given a continuous prior and a discrete likelihood function; I hope it is clear how that is useful, but I’d like to switch gears for a moment and look at a different (but also extremely useful) computation: the expected value.

I’ll start with a quick refresher on how to compute the expected value of a discrete distribution.

You probably already know what expected value of a discrete distribution is; we’ve seen it before in this series. But in case you don’t recall, the basic idea is: supposing we have a distribution of values of a type where we can meaningfully take an average; the “expected value” is the average value of a set of samples as the number of samples gets very large.

A simple example is: what’s the expected value of rolling a standard, fair six-sided die? You could compute it empirically by rolling 6000d6 and dividing by 6000, but that would take a while.


Aside: Again, recall that in Dungeons and Dragons, XdY is “roll a fair Y-sided die X times and take the sum”.


We could also compute this without doing any rolling; we’d expect that about 1000 of those rolls would be 1, 1000 would be 2, and so on.  So we should add up (1000 + 2000 + … + 6000) / 6000, which is just (1 + 2 + 3 + 4 + 5 + 6) / 6, which is 3.5. On average when you roll a fair six-sided die, you get 3.5.

We can then do variations on this scenario; what if the die is still fair, but the labels are not 1, 2, 3, 4, 5, 6, but instead -9, -1, 0, 1, 3, 30?  As I’m sure you can imagine, once again we can compute the expected value by just taking the average: (-9 – 1 + 0 + 1 + 3 + 30) / 6 = 4.


Aside: It’s a bit weird that the “expected value” of a distribution is a value that is not even in the support of the distribution; I’ve never once rolled a 3.5 on a d6. Beginners sometimes confuse the expected value with the mode: that is, the value that you’d expect to get more often than any other value. Remember, the expected value is just an average; it only makes sense in distributions where the sampled values can be averaged.


What if the die isn’t fair? In that case we can compute a weighted average; the computation is the value of each side, multiplied by the weight of that side, sum that, and divide by the total weight. As we saw in a previous episode:

public static double ExpectedValue(
    this IDiscreteDistribution<int> d) =>
  d.Support()
   .Select(s => 
     (double)s * d.Weight(s)).Sum() / d.TotalWeight();

And of course we could similarly define methods for discrete distributions of double and so on. Hopefully that is all clear.

The question I want to explore in the next few episodes requires us to make a small extension of the meaning of “expected value”:

  • Suppose we have a distribution of outcomes d, in the form of an IWeightedDistribution<double>
  • Suppose we have a function f from double to double which assigns a value to each outcome.
  • We wish to accurately and efficiently estimate the average value of f(d.Sample()) as the number of samples becomes large.

Aside: If we had implemented Select on weighted distributions, that would be the same as the expected value of d.Select(f)— but we didn’t!

Aside: We could be slightly more general and say that the distribution is on any T, and f is a function from T to double, but for simplicity’s sake we’ll stick to continuous, one-dimensional distributions in this series. At least for now.


There’s an obvious and extremely concise solution; if we want the average as the number of samples gets large, just compute the average of a large number of samples! It’s like one line of code. Since we make no use of any weights, we can take any distribution:

public static double ExpectedValue(
    this IDistribution<double> d,
    Func<double, double> f) =>
  d.Samples().Take(1000).Select(f).Average();

In fact, we could make this even more general; we only need to get a number out of the function:

public static double ExpectedValue<T>(
    this IDistribution<T> d,
    Func<T, double> f) =>
  d.Samples().Take(1000).Select(f).Average();

We could also make it more specific, for the common case where the function is an identity:

public static double ExpectedValue(
    this IDistribution<double> d) =>
  d.ExpectedValue(x => x);

Let’s look at a couple of examples. (Code for this episode can be found here.) Suppose we have a distribution from 0.0 to 1.0, say the beta distribution from last time, but we’ll skew it a bit:

var distribution = Beta.Distribution(2, 5);
Console.WriteLine(distribution.Histogram(0, 1));

      ****                              
     *******                            
    *********                           
    *********                           
   ***********                          
   ************                         
   *************                        
   **************                       
  ***************                       
  ****************                      
  *****************                     
  ******************                    
 *******************                    
 ********************                   
 **********************                 
 ***********************                
 ************************               
***************************             
*****************************           
----------------------------------------

It looks like we’ve heavily biased these coins towards flipping tails; suppose we draw a coin from this mint; what is the average fairness of the coins we draw? We can just draw a thousand of them and take the average to get an estimate of the expected value:

Console.WriteLine(distribution.ExpectedValue());

0.28099740981762

That is, we expect that a coin drawn from this mint will come up heads about 28% of the time and tails 72% of the time, which conforms to our intuition that this mint produces coins that are heavily weighted towards tails.

Or, here’s an idea; remember the distribution we determined last time: the posterior distribution of fairness of a coin drawn from a Beta(5, 5) mint, flipped once, that turned up heads. On average, what is the fairness of such a coin? (Remember, this is the average given that we’ve discarded all the coins that came up tails on their first flip.)

var prior = Beta.Distribution(5, 5);
IWeightedDistribution<Result> likelihood(double d) =>
  Flip<Result>.Distribution(Heads, Tails, d);
var posterior = prior.Posterior(likelihood)(Heads);
Console.WriteLine(posterior.ExpectedValue());

0.55313807698807

As we’d expect, if we draw a coin from this mint, flip it once, and it comes up heads, on average if we did this scenario a lot of times, the coins would be biased to about 55% heads to 45% tails.

So, once again we’ve implemented a powerful tool in a single line of code! That’s awesome.

Right?

I hope?

Unfortunately, this naive implementation has a number of problems.


Exercise: What are the potential problems with this implementation? List them in the comments!


Next time on FAIC: We’ll start to address some of the problems with this naive implementation.

Fixing Random, part 30

Last time on FAIC I posed and solved a problem in Bayesian reasoning involving only discrete distributions, and then proposed a variation on the problem whereby we change the prior distribution to a continuous distribution, while preserving that the likelihood function produced a discrete distribution.

The question is: what is the continuous posterior when we are given an observation of the discrete result?

More specifically, the problem I gave was: suppose we have a prior in the form of a process which produces random values between 0 and 1. We sample from that process and produce a coin that is heads with the given probability. We flip the coin; it comes up heads. What is the posterior distribution of coin probabilities?

I find this question a bit hard to wrap my head around. Here’s one way to think about it: Suppose we stamp the probability of the coin coming up heads onto the coin. We mint and then flip a million of those coins once each. We discard all the coins that came up tails. The question is: what is the distribution of probabilities stamped on the coins that came up heads?

Let’s remind ourselves of Bayes’ Theorem. For prior P(A) and likelihood function P(B|A), that the posterior is:

P(A|B) = P(A) P(B|A) / P(B)

Remembering of course that P(A|B) is logically a function that takes a B and returns a distribution of A and similarly for P(B|A).

But so far we’ve only seen examples of Bayes’ Theorem for discrete distributions. Fortunately, it turns out that we can do almost the exact same arithmetic in our weighted non-normalized distributions and get the correct result.


Aside: A formal presentation of the continuous version of Bayes Theorem and proving that it is correct would require some calculus and distract from where I want to go in this episode, so I’m just going to wave my hands here. Rest assured that we could put this on a solid theoretical foundation if we chose to.


Let’s think about this in terms of our type system. If we have a prior:

IWeightedDistribution<double> prior = whatever;

and a likelihood:

Func<double, IWeightedDistribution<Result>> likelihood = whatever;

then what we want is a function:

Func<Result, IWeightedDistribution<double>> posterior = ???

Let’s suppose our result is Heads. The question is: what is posterior(Heads).Weight(d) equal to, for any d we care to choose? We just apply Bayes’ Theorem on the weights. That is, this expression should be equal to:

prior.Weight(d) * likelihood(d).Weight(Heads) / ???.Weight(Heads)

We have a problem; we do not have an IWeightedDistribution<Result> to get Weight(Heads) from to divide through. That is: we need to know what the probability is of getting heads if we sample a coin from the mint, flip it, and do not discard heads.

We could estimate it by repeated computation. We could call

likelihood(prior.Sample()).Sample()

a billion times; the fraction of them that are Heads is the weight of Heads overall.

That sounds expensive though. Let’s give this some more thought.

Whatever the Weight(Heads) is, it is a positive constant, right? And we have already abandoned the requirement that weights have to be normalized so that the area under the curve is exactly 1.0. Positive constants do not affect proportionality. We do not actually need to compute the denominator at all in order to solve a continuous Bayesian inference problem; we just assume that the denominator is a positive constant, and so we can ignore it.

So posterior(Heads) must produce a distribution such that posterior(Heads).Weight(d) is proportional to:

prior.Weight(d) * likelihood(d).Weight(Heads)

But that is just a non-normalized weight function, and we already have the gear to produce a distribution when we are given a non-normalized weight function; we can use our Metropolis class from two episodes ago. It can take a non-normalized weight function, an initial distribution and a proposal distribution, and produce a weighted distribution from it.


Aside: Notice that we don’t even need prior to be a distribution that we can sample from; all we need is its weight function.


That was all very abstract, so let’s look at the example I proposed last time: a mint with poor quality control produces coins with a particular probability of coming up heads; that’s our prior.

Therefore we’ll need a PDF that has zero weight for all values less than 0 and greater than 1. We don’t care if it is normalized or not.

Remember, this distribution represents the quality of coins that come from the mint, and the value produced by sampling this distribution is the bias of the coin towards heads, where 0 is “double tailed” and 1 is “double headed”.

Let’s choose a beta distribution.


Aside: I’ve implemented the very useful gamma and beta distributions, but I won’t bother to show them in the blog. The code is tedious and explaining the mathematics would get us pretty off-topic. The short version is that the beta distribution is a distribution on numbers between 0 and 1 that can take on a variety of shapes. See the source code repository for the details of the implementation, and Wikipedia for an explanation of the distribution; various possible shapes for its two parameters are given below:

Probability density function for the Beta distribution

The code for the Beta and Gamma distributions and the rest of the code in this episode can be found here.


Let’s take Beta.Distribution(5, 5) as our prior distribution. That is, the “fairness” of the coins produced by our terrible mint is distributed like this:

var prior = Beta.Distribution(5, 5);
Console.WriteLine(prior.Histogram(0, 1));

                  ****                   
                 ******                 
                ********                
               **********               
               **********               
              ************              
              ************              
             **************             
             ***************            
            ****************            
            ****************            
           ******************           
          ********************          
          ********************          
         **********************         
        ***********************         
        ************************        
       **************************       
     ******************************     
----------------------------------------

Most of the coins are pretty fair, but there is still a significant fraction of coins that produce 90% heads or 90% tails.

We need a likelihood function, but that is easy enough. Let’s make an enumerated type for coins called Result with two values, Heads and Tails:

IWeightedDistribution<Result> likelihood(double d) => 
  Flip<Result>.Distribution(Heads, Tails, d);

We can use Metropolis to compute the posterior, but what should we use for the initial and proposal distributions? The prior seems like a perfectly reasonable guess as to the initial distribution, and we can use a normal distribution centered on the previous sample for the proposal.

IWeightedDistribution<double> posterior(Result r) =>
  Metropolis<double>.Distribution(
    d => prior.Weight(d) * likelihood(d).Weight(r), // weight
    prior, // initial
    d => Normal.Distribution(d, 1)); // proposal

But you know what, let’s once again make a generic helper function:

public static Func<B, IWeightedDistribution<double>> Posterior<B>(
  this IWeightedDistribution<double> prior,
  Func<double, IWeightedDistribution<B>> likelihood) =>
    b => 
      Metropolis<double>.Distribution(
        d => prior.Weight(d) * likelihood(d).Weight(b),
        prior,
        d => Normal.Distribution(d, 1));

And now we’re all set:

var posterior = prior.Posterior(likelihood);
Console.WriteLine(posterior(Heads).Histogram(0, 1));

                     ****               
                    ******              
                   *******              
                  *********             
                 ***********            
                ************            
                ************            
               *************            
               **************           
               ***************          
              ****************          
             ******************         
             ******************         
            *******************         
            ********************        
           **********************       
          ************************      
         *************************      
       *****************************    
----------------------------------------

It is subtle, but if you compare the prior to the posterior you can see that the posterior is slightly shifted to the right, which is the “more heads” end of the distribution.


Aside: In this specific case we could have computed the posterior exactly. It turns out that the posterior of a beta distribution in this scenario is another beta distribution, just with different shape parameters. But the point is that we don’t need to know any special facts about the prior in order to create a distribution that estimates the posterior.


Does this result make sense?

The prior is that a coin randomly pulled from this mint is biased, but equally likely to be biased towards heads or tails. But if we get a head, that is evidence, however slight, that the coin is not biased towards tails. We don’t know if this coin is close to fair, or very biased towards heads, or very biased towards tails, but we should expect based on this single observation that the first and second possibilities are more probable than the prior, and the third possibility is less probable than the prior. We see that reflected in the histogram.

Or, another way to think about this is: suppose we sampled from the prior many times, we flipped each coin exactly once, and we threw away the ones that came up tails, and then we did a histogram of the fairness of those that came up heads. Unsurprisingly, the coins more likely to turn up tails got thrown away more often than the coins that came up heads, so the posterior histogram is biased towards heads.


Exercise: Suppose we changed the scenario up so that we were computing the posterior probability of two flips. If we get two heads, obviously the posterior will be even more on the “heads” side. How do you think the posterior looks if we observe one head, one tail? Is it different than the prior, and if so, how? Try implementing it and see if your conjecture was correct.


What have we learned? That given a continuous prior, a discrete likelihood, and an observation, we can simulate sampling from the continuous posterior.

Why is this useful? The coin-flipping example is not particularly interesting, but of course coin-flipping is a stand-in for more interesting examples. Perhaps we have a prior on how strongly a typical member of a population believes in the dangers of smoking; if we observe one of them smoking, what can we deduce is the posterior strength of their beliefs? Or perhaps we have a prior on whether a web site user will take an action on the site, like buying a product or changing their password, or whatever; if we observe them taking that action, what posterior deductions can we make about their beliefs?

As we’ve seen a number of times in this series, it’s hard for humans to make correct deductions about posterior probabilities even when given good priors; if we can make tools available to software developers that help us make accurate deductions, that has a potentially powerful effect on everything from the design of developer tools to the design of public policy.


Next time on FAIC: Thus far in this series we’ve been very interested in the question “given a description of a distribution, how can I sample from it efficiently?” But there are other things we want to do with distributions; for example, suppose we have a random process that produces outcomes, and each outcome is associated with a gain or loss of some amount of money. What is the expected amount we will gain or lose in the long run? We’ll look at some of these “expected value” problems, and see how the gear we’ve developed so far can help compute the answers.

 

Fixing Random, part 29

[It is] a spectacular vindication of the principle that each individual coin spun individually is as likely to come down heads as tails and therefore should cause no surprise that each individual time it does.

Thus Guildenstern (or is it Rosencrantz?) in Tom Stoppard’s re-imagining of Hamlet “Rosencrantz and Guildenstern Are Dead“. If you haven’t seen it already, go watch the movie, or at least the first scene.

It helps to know the plot of Hamlet: Before the play begins, prince Hamlet’s uncle Claudius has murdered Hamlet’s father, become king, and married Hamlet’s mother Gertrude. Hamlet is understandably perturbed by this sequence of events. Claudius and Gertrude have invited Hamlet’s college friends Rosencrantz and Guildenstern to cheer him up, but SPOILER ALERT in a series of misadventures they end up dead, soon to be followed by everyone else; the plot, such as it is, of R&G Are Dead is this story told from their confused, uninformed, and amnesiac perspective.

.

.

.

Welcome back. As you now know, if you did not before, Rosencrantz and Guildenstern are flipping (or, as they say, spinning) a series of coins. In the film sometimes they are flipping a different coin each time, which Rosencrantz (or is it Guildenstern?) wins, and sometimes they’re flipping the same coin several times in a row. Either way, the coin flipper (whatever his name is) hits an enormously unlikely sequence of heads.

Let’s make up a variation on this scenario to see what we can learn about posterior  distributions. (Code for this episode can be found here.)

Suppose we have two kinds of coins: fair, and double-headed. A fair coin has a 50-50 probability of coming up heads or tails; a double-headed coin always comes up heads.

enum Coin { Fair, DoubleHeaded }

Now let’s suppose we have a bag of coins. 999 are fair, and one is double-headed. Rosencrantz is going to pull a coin from the bag at random. Our prior is that the selection is random over the thousand coins in the bag, so:

var coins = new[] { Fair, DoubleHeaded };
var prior = coins.ToWeighted(999, 1);

Once he has his coin, Rosencrantz flips it, and of course, observes that it is heads.

The question I want to explore is: what is the posterior probability that Rosencrantz just flipped the double-headed coin?

That is, Rosencrantz draws a coin, flips it, it comes up heads, and now the question to you is: what would be fair odds for a bet on the question “is it the double-headed coin or a regular coin that was just flipped?” Prior to the observation the fair odds are 1 to 999, but after we observe the flip, those odds change because we have more information.

Reasoning backwards like this is a little tricky. (Of course we call it the posterior distribution because we are reasoning backwards, in case that was unclear.) Again, let’s break it down:

  • Prior to flipping the coin, the probability of getting the double-headed coin is the aptly-named prior probability: 1 in 1000, which is odds of 1 to 999.
  • If Rosencrantz flipped the coin and got tails, that would be solid evidence that he did not get the double-headed coin; the posterior probability of having the double headed coin is zero. That is, the posterior probability is smaller: we’ve gone from 0.1% to 0%, a small decrease.
  • If Rosencrantz got heads, that is very weak evidence that he got a double-headed coin, but it is not zero evidence. There should be some small increase over the prior in the posterior.

Of course we don’t need to do the math by hand; we know from earlier in this series that we can exactly solve for posterior probabilities by using a combination of Where and SelectMany query operators:

var results = new[] { Heads, Tails };
IDiscreteDistribution<Result> likelihood(Coin c) =>
  results.ToWeighted(1, c == Fair ? 1 : 0);

var c1 = from c in prior
         from r in likelihood(c)
         where r == Heads
         select c;
Console.WriteLine(c1.ShowWeights());

        Fair:999
DoubleHeaded:2

If Rosencrantz flipped heads then the probability that he drew the double-headed coin is slightly increased over the prior. The prior probability of having the double-headed coin is 1 in 1000; the posterior probability is 2 in 1001. That’s almost twice as likely, but still pretty unlikely.

Again, let’s make sure that we understand what this means. It means that if we did millions of repetitions of this experiment where Rosencrantz picks a coin and flips it once, and if we discarded all the repetitions in which Rosencrantz flipped tails, then in the remaining repetitions on average the ratio of double-headed coins to fair coins chosen would be about 2 to 999.

What if we increase the number of flips of that same coin, and he got two heads?

var c2 = from c in prior
         from r1 in Likelihood(c)
         where r1 == Heads
         from r2 in Likelihood(c)
         where r2 == Heads
         select c;
Console.WriteLine(c2.ShowWeights());

        Fair:999
DoubleHeaded:4

The posterior probability of having the double-headed coin goes up again, this time to 4 in 1003.

What if we increased it to ten flips (again, of the same coin), and they’re all heads? By the time we get to ten heads, there are two possibilities: the exactly one-in-a-thousand chance that Rosencrantz got the double-headed coin, or the approximately one-in-a-thousand chance that he got a fair coin and flipped ten heads in a row; the ratio of those two probabilities is the posterior odds. At this point it becomes pretty much even money whether he has the double-headed coin or a fair coin. (As you might expect, the exact odds are 999 “fair” to 1024 “double-headed”, which is pretty darn close to 50-50.)

If we observe eleven or more heads, it becomes much more likely that Rosencrantz has drawn the one-in-a-thousand double-headed coin than that he is lucky enough to observe that sequence of heads from a fair coin, and of course by the time we get to 156 heads, it is absurdly more likely to be the double-headed coin than a fair coin.

This is a cute problem and all, but it seems a little out of place at this point in our series; we stopped talking about discrete distributions that we could solve exactly for posteriors a while back. I’m discussing it now because all this is in the way of introducing a variation on the above problem:

  • We have a mint with really poor quality control; it produces coins that are out of balance, so that some of them favour heads and some of them favour tails.
  • To model this, let’s say that the mint samples from some continuous distribution that produces values strictly between 0.0 and 1.0. Let’s say that we know this distribution.
  • Each coin that is minted is modeled as a Bernoulli distribution of heads and tails, with the given probability of heads.
  • Rosencrantz flips a random coin that was obtained from that mint; this is our prior.
  • We observe that Rosencrantz gets heads.
  • What is the posterior distribution on the fairness of that coin?

When I first started thinking about this problem I found it quite tricky to get it clear in my head, but this continuous version is not fundamentally different than the discrete version I gave above. In both cases we have a prior distribution of coin fairness that we sample from to get a coin. In both cases we then obtain an observation of “heads”. And in both cases, we can use that observation to deduce something about the likely provenance of the coin. Whether we’re in the continuous case or the discrete case, flipping heads is evidence that the coin is more likely to be on the “heads heavy” side of the prior distribution. But how much more likely?

In the all-discrete case we could compute that posterior exactly. The question before us now is: given the continuous prior, the discrete likelihood function and an observation of a discrete value, can we compute the weight function of the continuous posterior? That is, can we answer the question “what is the probable fairness level of this coin given that we’ve observed one head?” And of course, given all that information, can we also produce a distribution object that we can sample from?

Give it some thought.


Next time on FAIC: We will attempt to do so using the tools we’ve developed thus far.