Last time on FAIC we made a correct, efficient implementation of `SelectMany`

to bind a likelihood function and projection onto a prior, and gave a simple example. I deliberately chose “weird” numbers for all the weights; let’s do that same example again but with more “nice round number” weights:

var prior = new List<Height>() { Tall, Medium, Short }

.ToWeighted(60, 30, 10);

[…]

IDiscreteDistribution<Severity> likelihood(Height h)

{

switch(h)

{

case Tall: return severity.ToWeighted(45, 55, 0);

case Medium: return severity.ToWeighted(0, 70, 30);

default: return severity.ToWeighted(0, 0, 1);

}

}

[… projection as before…]

Console.WriteLine(prior

.SelectMany(likelihood, projection)

.ShowWeights());

This produces the output:

DoubleDose:270000 NormalDose:540000 HalfDose:190000

which is correct, but you notice how multiplying the weights during the `SelectMany`

made for some unnecessarily large weights. If we then did another `SelectMany`

on this thing, they’d get even larger, and we’d be getting into integer overflow territory.

Integer overflow is always possible in the system I’ve developed so far in this series, and I am deliberately glossing over this serious problem. A better implementation would either use doubles for weights, which have a much larger range, or arbitrary-precision integers, or arbitrary-precision rationals. I’m using integers to keep it simple, but as with many aspects of the code in this series, that would become problematic in a realistic implementation.

One thing we can do to tame this slightly is to reduce all the weights when possible; plainly in this case we could divide each of them by 10000 and have exactly the same distribution, so let’s do that. And just to make sure, I’m going to mitigate the problem in multiple places:

- In
`SelectMany`

we could be taking the least common multiple (LCM) instead of the full product of the weights. - In the
`WeightedInteger`

factory we could be dividing out all the weights by their greatest common divisor (GCD).

Long-time readers of my blog may recall that I’ve implemented Euclid’s Algorithm before, but this time I’m going to make a much simpler implementation:

public static int GCD(int a, int b) =>

b == 0 ? a : GCD(b, a % b);

We define the GCD of two non-negative integers a and b as:

- if both zero, then zero
- otherwise, if exactly one is zero, then the non-zero one
- otherwise, the largest integer that divides both.

**Exercise:** Prove that this recursive implementation meets the above contract.

The problem we face though is that we have many weights and we wish to find the GCD of all of them. Fortunately, we can simply do an aggregation:

public static int GCD(this IEnumerable<int> numbers) =>

numbers.Aggregate(GCD);

Similarly we can compute the LCM if we know the GCD:

public static int LCM(int a, int b) =>

a * b / GCD(a, b);

public static int LCM(this IEnumerable<int> numbers) =>

numbers.Aggregate(1, LCM);

And now we can modify our `WeightedInteger`

factory:

public static IDiscreteDistribution<int> Distribution(

IEnumerable<int> weights)

{

List<int> w = weights.ToList();

int gcd = weights.GCD();

for (int i = 0; i < w.Count; i += 1)

w[i] /= gcd;

[…]

And our `SelectMany`

:

int lcm = prior.Support()

.Select(a => likelihood(a).TotalWeight())

.LCM();

[… and then use the lcm in the query …]

See the code repository for all the details. If we apply all these changes then our results look much better…

DoubleDose:27 NormalDose:54 HalfDose:19

… and we are at least a little less likely to get into an integer overflow situation.

**Aside:** Of course we can do the same thing to the `Bernoulli`

class, and normalize its weights as well.

**Next time on FAIC:** We can use the gear we’ve created so far to solve problems in *Bayesian inference*; we’ll see how.

oh no, spoilers! 8)

Did you accidentally press the publish button too early?

Yes, I accidentally pressed “Publish” on February 4th, when I wrote this part. Oops.

Pingback: So long, MSDN blog | Fabulous adventures in coding

You might want to add a note to the next episode to alert people using RSS readers, who might miss out on this episode because it was published (and unpublished) a few weeks ago.

If integer overflow is a concern, wouldn’t it be cleaner to do “a / GCD(a, b) * b” for the LCM calculation?

Pingback: Dew Drop – March 25, 2019 (#2925) | Morning Dew