We’re going to spend the next while in this series on expressing probabilities that have some sort of extra condition associated with them. We’ll start with a simple example: I want to roll a fair six-sided die, but discard the threes. Why? Maybe I really dislike threes. Doesn’t matter; we will want to be able to create new distributions by putting conditions on other distributions, so let’s figure out how to do so.
Now, of course, we could create the desired distribution easily enough; as we know from last time on FAIC, it is
WeightedInteger.Distribution(0, 1, 1, 0, 1, 1, 1)
But suppose we have a distribution already in hand which we wish to apply a condition to, post hoc. I’m not going to labour the point; we can see how to do that easily enough. It’s a variation on our “projected” distribution from a while back:
public sealed class Conditioned<T> :
IDiscreteDistribution<T>
{
private readonly List<T> support;
private readonly IDiscreteDistribution<T> underlying;
private readonly Func<T, bool> predicate;
public static IDiscreteDistribution<T> Distribution(
IDiscreteDistribution<T> underlying,
Func<T, bool> predicate)
{
var d = new Conditioned<T>(underlying, predicate);
if (d.support.Count == 0)
throw new ArgumentException();
if (d.support.Count == 1)
return Singleton<T>.Distribution(d.support[0]);
return d;
}
private Conditioned(
IDiscreteDistribution<T> underlying,
Func<T, bool> predicate)
{
this.underlying = underlying;
this.predicate = predicate;
this.support = underlying.Support()
.Where(predicate)
.ToList();
}
public T Sample()
{
while (true) // Ouch
{
T t = this.underlying.Sample();
if (this.predicate(t))
return t;
}
}
public IEnumerable<T> Support() =>
this.support.Select(x => x);
public int Weight(T t) =>
predicate(t) ? underlying.Weight(t) : 0;
}
Given our discussion a few episodes back about projected distributions and the fact that creating them has the same signature as “Select”, I’m sure it will come as no surprise to you that I’m going to make a helper method:
public static IDiscreteDistribution<T> Where<T>(
this IDiscreteDistribution<T> d,
Func<T, bool> predicate) =>
Conditioned<T>.Distribution(d, predicate);
And you know what this means: I get to use comprehension syntax again!
var noThrees = from roll in SDU.Distribution(1, 6)
where roll != 3
select roll;
Console.WriteLine(noThrees.Histogram());
And we get as expected, no threes:
1|*************************************** 2|*************************************** 4|*************************************** 5|*************************************** 6|***************************************
However, there are some problems here. That possibly-long-running loop is deeply worrisome. In fact, dealing with the existence of that loop will be the major theme of the rest of this series, in one way or another. (This should feel familiar; of course this is just another version of the “rejection sampling” problem from a few episodes ago.)
We’ll talk about that loop problem more in the next episode. For the remainder of this episode I want to examine an assumption I made in the code above; it’s the same assumption that I made when we discussed projections. We just blew right past it, but this assumption introduces what might be characterized as a serious bug.
Consider the following code, which uses only sequences, no distributions:
int filterOut = 3;
Func<int, bool> predicate = x => x != filterOut;
var range = Enumerable.Range(1, 6).Where(predicate);
Console.WriteLine(range.CommaSeparated());
filterOut = 4;
Console.WriteLine(range.CommaSeparated());
Aside: I got sick of typing string.Join(… blah blah blah…) so I made a handful of extension methods to be a little more fluent. See the source code for details.)
If you recall that sequences are computed lazily and lambdas are closed over variables, not values, then this output should be expected:
1,2,4,5,6 1,2,3,5,6
What if we do the same thing to distributions?
int filterOut = 3;
Func<int, bool> predicate = x => x != filterOut;
var d = SDU.Distribution(1, 6).Where(predicate);
Console.WriteLine(d.Samples().Take(10).CommaSeparated());
filterOut = 4;
Console.WriteLine(d.Samples().Take(10).CommaSeparated());
As we’d expect, we first get no threes, and then no fours:
1,1,4,6,6,5,4,1,2,6 6,5,6,5,1,5,6,3,2,1
So what’s the problem?
Console.WriteLine(d.Support().CommaSeparated());
1,2,4,5,6
Uh oh. We just produced a 3 in a distribution that does not list 3 in its support!
Why? Because I computed the support in the constructor and cached it, but the support changed when the predicate changed its behaviour, and it is now out-of-date. The object is now lying to us.
Obviously I could fix this problem easily enough: do not compute the support in the constructor; compute it dynamically, on demand. Is that the right thing to do?
If we do, we lose the ability to reject “null” distributions — distributions with no support — and therefore it is possible to get into a situation where sampling hangs because the predicate is never true. (We could re-check the support before every sample, and throw if the support is empty, but that seems expensive.)
Furthermore, as we’ll see in the next few episodes, we can do some pretty good optimizations if we can assume that the predicate does not change.
I’m therefore going to state right now that the predicate passed to Where (and the projection passed to Select) must be “pure” functions when they are intended to operate on probability distributions.
A “pure” function for our purposes has the following characteristics:
- It must complete normally; in its regular operation it should not hang and it should not throw. It is permissible for a pure function to throw if its preconditions are violated; for example, a pure function that is documented as not taking negative numbers is permitted to throw an argument exception if passed a negative number. But a pure function should not throw or hang when given a normal, expected input.
- It must not produce any side effect. For example, it must not write to the console or mutate a global variable or anything like that.
- Its outputs must depend solely on its inputs; a pure method does not produce one result on its first call and a different result on its second call because something in the world changed; the only reason to produce a different result is if the arguments passed in are different.
Pure functions are nice to work with in two particular ways. First, you can reason about their correctness entirely “locally”; you do not have to consider the state of the world at the time the call is made, and you do not have to worry that the state of the world will change depending on how many times the call happens. Second, you can get performance wins by taking advantage of the fact that you know that two calls with the same arguments will produce the same result.
Unfortunately, in C# we have no good way of detecting a non-pure method at compile time and outlawing them as arguments to Where and Select; we will have to rely on the user being smart enough to not shoot themselves in the foot here.
Next time on FAIC: Now that we are requiring purity in Where and Select, what optimizations can we make to the discrete distributions we’ve created so far?
I keep meaning to ask you whether you know of Hansei, Oleg Kiselyov’s work on a probabilistic programming DSL (http://okmij.org/ftp/kakuritu/)? I sort’ve assumed you had, given how much work you’ve been doing both in probabilistic programming and with (delimited? I hope!) continuations.
Thanks for the link; I have looked at a number of PPLs while researching this series, but I had not yet looked at that one.
The ideas I’m going to cover in this series are some of the simpler scenarios covered by a proper, industrial-strength PPL. I’m considering whether I want to get this series all the way to coroutines implemented as multi-shot continuations; there’s a lot of background there to cover.
Is it possible in C# to detect non-pure functions by taking an Expression instead of the Func predicate as input?
Same question here.
I don’t see how Expression would help – it’s an abstract syntax tree, after all. Sooner or later you have to compile it, and then that little LaunchTheMissiles() method is called.
In Haskell they can very clearly segregate pure from impure code because the type system forbids it – if a method returns something in the IO monad, it’s impure. If it’s not in the IO monad, it’s pure because you can’t even write impure stuff outside IO.
Well, first off, lambdas converted to expression trees cannot contain assignments or increments, so that helps a little bit. But expression tree lambdas can be closed over mutable variables and they can read global state. So it’s not much help. I suppose you could write a visitor that looks for closures and accessing global state and throw. But expression tree lambdas can also call impure functions. If you had a detector for impure functions you could use that, but now the question is begged; we are trying to create a detector for impure functions, so presupposing that we have one doesn’t help!
Pingback: Fixing Random, part 11 | Fabulous adventures in coding