Fixing Random, part 15

[Code is here.]

Last time on FAIC we made a correct, efficient implementation of SelectMany to bind a likelihood function and projection onto a prior, and gave a simple example. I deliberately chose “weird” numbers for all the weights; let’s do that same example again but with more “nice round number” weights:

This produces the output:

DoubleDose:270000
NormalDose:540000
HalfDose:190000

which is correct, but you notice how multiplying the weights during the SelectMany made for some unnecessarily large weights. If we then did another SelectMany on this thing, they’d get even larger, and we’d be getting into integer overflow territory.

Integer overflow is always possible in the system I’ve developed so far in this series, and I am deliberately glossing over this serious problem. A better implementation would either use doubles for weights, which have a much larger range, or arbitrary-precision integers, or arbitrary-precision rationals. I’m using integers to keep it simple, but as with many aspects of the code in this series, that would become problematic in a realistic implementation.

One thing we can do to tame this slightly is to reduce all the weights when possible; plainly in this case we could divide each of them by 10000 and have exactly the same distribution, so let’s do that. And just to make sure, I’m going to mitigate the problem in multiple places:

In SelectMany we could be taking the least common multiple (LCM) instead of the full product of the weights.
In the WeightedInteger factory we could be dividing out all the weights by their greatest common divisor (GCD).

Long-time readers of my blog may recall that I’ve implemented Euclid’s Algorithm before, but this time I’m going to make a much simpler implementation:

public static int GCD(int a, int b) =>
b == 0 ? a : GCD(b, a % b);

We define the GCD of two non-negative integers a and b as:

if both zero, then zero
otherwise, if exactly one is zero, then the non-zero one
otherwise, the largest integer that divides both.

Exercise: Prove that this recursive implementation meets the above contract.

The problem we face though is that we have many weights and we wish to find the GCD of all of them. Fortunately, we can simply do an aggregation:

public static int GCD(this IEnumerable<int> numbers) =>
numbers.Aggregate(GCD);

Similarly we can compute the LCM if we know the GCD:

public static int LCM(int a, int b) =>
a * b / GCD(a, b);
public static int LCM(this IEnumerable<int> numbers) =>
numbers.Aggregate(1, LCM);

And now we can modify our WeightedInteger factory:

public static IDiscreteDistribution<int> Distribution(
IEnumerable<int> weights)
{
List<int> w = weights.ToList();
int gcd = weights.GCD();
for (int i = 0; i < w.Count; i += 1)
w[i] /= gcd;
[…]

And our SelectMany:

int lcm = prior.Support()
.Select(a => likelihood(a).TotalWeight())
.LCM();
[… and then use the lcm in the query …]

See the code repository for all the details. If we apply all these changes then our results look much better…

DoubleDose:27
NormalDose:54
HalfDose:19

… and we are at least a little less likely to get into an integer overflow situation.

Aside: Of course we can do the same thing to the Bernoulli class, and normalize its weights as well.

Next time on FAIC: We can use the gear we’ve created so far to solve problems in Bayesian inference; we’ll see how.

7 thoughts on “Fixing Random, part 15”

Dave on February 4, 2019 at 3:44 pm said:

oh no, spoilers! 8)

Reply ↓
sollyucko on February 4, 2019 at 5:27 pm said:

Did you accidentally press the publish button too early?

Reply ↓
ericlippert on March 22, 2019 at 10:02 am said:

Yes, I accidentally pressed “Publish” on February 4th, when I wrote this part. Oops.

Reply ↓
Pingback: So long, MSDN blog | Fabulous adventures in coding
joeym on March 24, 2019 at 1:27 pm said:

You might want to add a note to the next episode to alert people using RSS readers, who might miss out on this episode because it was published (and unpublished) a few weeks ago.

Reply ↓
Scepheo on March 25, 2019 at 4:14 am said:

If integer overflow is a concern, wouldn’t it be cleaner to do “a / GCD(a, b) * b” for the LCM calculation?

Reply ↓
Pingback: Dew Drop – March 25, 2019 (#2925) | Morning Dew