# Fixing Random, part 17

Before we get going on today’s episode of FAIC, you might want to refresh your memory of what an additive monad is; I wrote an episode of my monad series on this subject. Briefly, an additive monad is a monad where there is a “zero value”; like the number zero, “multiplying” by zero produces a zero, and “adding” a zero is an identity.

For example, the sequence monad, `IEnumerable<T>`, has a zero: the empty sequence. If we `Select` or `SelectMany` from the empty sequence — our analog of “multiplying” — we get an empty sequence.  If we concatenate an empty sequence onto another sequence — the sequence analog of “adding” — we get the original sequence.

All additive monads can have a `Where` function defined on them; if we wanted to implement `Where` for sequences and didn’t care about performance, we could implement it like this:

public static IEnumerable<T> Single<T>(T t)
{
yield return t;
}
public static IEnumerable<T> Zero<T>()
{
yield break;
}
// Non-standard Where:
public static IEnumerable<T> Where<T>(
this IEnumerable<T> items,
Func<T, bool> predicate) =>
from a in items
from b in predicate(a) ? Single(a) : Zero<T>()
select b;

That’s slow and produces hideous collection pressure, but it works; our actual implementation of `Where` is just an optimization.

What about the converse? Our probability monad `IDiscreteDistribution<T>` has a `Where` function defined. We definitely have a `Singleton<T>` type. But our implementation of the distribution monad does not appear to have a zero value. It seems plausible that there should be a way to express `Where` on distributions as we did with the sequence monad: as a `SelectMany` that produces either the single or zero distributions based on the predicate.

Give that some thought, and then scroll down when you think you have it worked out.

.

.

.

.

.

.

Just as the zero of the sequence monad is the empty sequence, the zero of the distribution monad is the empty distribution. That is, the distribution with an empty support that throws every time it is sampled.

We never implemented this value because every distribution class we’ve created already throws when you try to create an empty distribution:

• `StandardDiscreteInteger` throws if the range is empty.
• `Bernoulli` and `WeightedInteger` both throw if you give them all zero weights.
• In our current implementation a `Where` clause where the predicate is false for everything in the support of the underlying collection will eventually throw.
• In our original implementation, a `Where` clause where the predicate is always false hangs when sampled, but does not throw.
• Our implementation of `Select` throws if the support is empty.
• And so on.

Exercise: We have learned the following facts:

• The zero value of the discrete distribution monad is the empty distribution.
• The joint distribution produced by `SelectMany` is the analog of multiplication of two distributions.
• Concatenation is the “addition” of the sequence monad. (The two sequences have to be of the same element type.)

I think it is pretty clear that doing a `SelectMany` on an empty distribution has to produce an empty distribution. But we still have a mystery to solve: what is the addition operator on two discrete distributions? They have to be of the same element type. The addition operator has to have the property that adding zero to any distribution is an identity, but what does it mean to add together two non-zero distributions?

It turns out that there are some uses for an explicit empty distribution; we’ll discover what the specific benefits of it are in a later episode.

What are the costs? I don’t mean implementation costs, but rather, what are the down sides to developers of having this feature? In short: if we go down this road, what new opportunities for bugs are we producing?

One interesting cost is that we will defer an operation that can throw; this can be very confusing! A classic source of StackOverflow questions is when someone writes an enumerator block:

static IEnumerable<int> Foo(string bar)
{
if (bar == null)
throw new ArgumentNullException();
yield return bar.Length;
…
}

and then calls it:

var foo = Foo(accidentallyNullThing); // no throw
[…]
foreach (int x in foo) // throw!

The source of the problem is that the throw is delayed. If you look at the proper, industrial-strength implementations of `Where`, `Select` and so on, you’ll notice that each one is written in a style where it validates its arguments first, and then returns a call to a helper method that actually does the iteration. That way the exception is thrown close to the point of the mistake.

However, that doesn’t fix other common variations on the problem. For example, you might have some buggy code that produces an empty sequence sometimes, and then a thousand lines later you call `First` on the sequence and it blows up, but the bug is where the sequence is produced.

And of course this is really no different than nullable types that blow up when we forget that they can be null; a nullable `T` is logically a sequence of `T` where the sequence length is either zero or one, and if we forget that it can be “zero length”, we get into trouble.

The empty distribution will have the same property: it will be easy to create it by accident in a buggy program and it will not blow up until it is sampled, just as nullable reference types do not blow up until they are dereferenced.

That said, we’re going to do it because the benefits are actually pretty compelling, oddly enough.

Next time on FAIC: In the next regularly-scheduled episode we will implement the empty distribution; it’ll be quite straightforward, but it will necessitate fixing up some of our existing code. However, before then I’m going to interrupt this series with a very special episode that addresses a long-standing puzzler in probability theory which I just realized we now have all the gear we need to answer. Stay tuned!

## 17 thoughts on “Fixing Random, part 17”

1. My immediate intuition for adding two non-empty discrete distributions is to produce a new distribution with the combined support of the two being added. So adding distributions with supports { 1, 1, 2, 2, 2, 3 } and { 2, 3, 3, 4, 5, 5 } would give a distribution with support { 1, 1, 2, 2, 2, 2, 3, 3 ,3, 4, 5, 5 }.

I haven’t thought through all the implications of this, but it at least seems reasonable.

• That was my thought too. It also gives you correct results for the Where implementation from today’s post. On the other hand, the result of adding a distribution to itself is odd. You get back an equivalent distribution.

• Yeah, that’s interesting. I’m also having trouble figuring out an intuitive meaning for adding distributions together.

If I add the distributions for rolling a 4-sided die and a 6-sided die to get { 1, 1, 2, 2, 3, 3, 4, 4, 5, 6 }, what does sampling a value, and the associated probabilities, mean in terms of operations with those dice?

• I d say youd have effectively {1W6, 2W6, 3W6, 4W6, 5W6, 6W6, 1W4, 2W4, 3W4, 4W4} as sample and you’d need to provide a projection function to get from each added distribution to the final output space. That would also solve the duplicate issue. This projection can default to identity in case type of distribution one and two are equal.

• The “intuitive meaning” is that sampling from this distribution is equivalent to “flip a coin: heads, roll 1D6, tails, roll 1D4”.

• And that explains why adding a distribution to itself is an identity: “heads, roll 1D6, tails, roll 1D6” is the same as “roll 1D6”.

• An oddity of this interpretation is that addition of discrete distributions is not associative. That is, (A+B)+C is not necessarily the same distribution as A+(B+C). The addition operator on sequences is associative, since concatenation is associative.

• The problem with that is that the outcomes are not evenly weighted. If you just do a union of the supports then each item has an equal weight, in our case a 1-in-10 probability. If you flip a coin to choose the die to roll, then the D6 faces each have a probability of 1/12, and the D4 faces, 1/8. This could be fixed by using a weighted coin.

• Looking at associativity, I agree that the flip-a-coin-and-sample interpretation is not associative. However, if adding distributions is just combining the supports into a singe, larger support, then I think associativity holds as long as you don’t normalize the resulting support.

• It seems like drewjcooper’s “never normalize” method assigns each IDiscreteDistribution an implicit “scale factor” equal to the sum of its weights.

To get to the way you want (flip a coin, then use that distribution), you can assign an explicit scale factor. All IDiscreteDistributions start with a scale of 1 except for the null distribution, which has scale 0. The scale of the sum of two distributions is the sum of the two scales. Then associativity should work. (Also note that you can store scale as a nonnegative integer.)

I’m not sure how this actually maps back into mathematics. I guess it can be some kind of “improper prior” over the distributions.

Without the (implicit or explicit) scale, adding distributions gets a little bit weird. You get a + a = a, which seems a bit counterintuitive.

2. My immediate intuition was that if “multiplication” is a dependent distribution then maybe “addition” is an independent distribution? Meaning, doing a fair coin flip { H, T }, added to another, gives you { (H, H), (H, T), (T, H), (T, T) }

3. I’ve often wondered, could C# have had “yield continue” (alongside yield-break) where the start of the functions runs immediately up to the yield-continue, then hands control back to the caller until the loop starts.

While I’m making wishes, I’d also like “yield foreach X” which would expand to:
foreach(var item in X) yield return item;

You still work on the C# team at Microsoft, right?

• Your first suggestion is basically how “await” works. In an async method, invoking the method causes it to run until the first await on a non-completed task. So yes, C# 2 could have been designed so that the code runs up until the first yield, and then hands back an iterator in unstarted state.

There was a research version of C# called C-Omega that had the “yield foreach” feature, and F# also has this feature in their “yield!” operator, which is the most exciting operator. There are a few technical problems in making it efficient; there are some bad performance cases for your naive “just turn it into a foreach” proposal. It’s come up many times in the C# design meeting and never been considered an important enough feature to make it into the language, but there is not an in-principle reason why it cannot be done.

I left Microsoft in November of 2012; since then I’ve worked at Coverity and Facebook.