Fixing Random, bonus episode 2: pigeons and the noisy-or distribution

Source code for this episode is here.


Welcome to this special bonus episode of Fixing Random, the immensely long blog series where I discuss ways to add probabilistic programming features into C#. I ran into an interesting problem at work that pertains to the techniques that we discussed in this series, so I thought I might discuss it a bit today.

Let’s suppose we have three forts, Fort Alpha, Fort Bravo and Fort Charlie at the base of a mountain. They are constantly passing messages back and forth by carrier pigeon. Alpha and Charlie are too far apart to fly a pigeon directly, so messages from Alpha to Charlie first go from Alpha to Bravo, and then on from Bravo to Charlie. (Similarly. messages from Charlie to Alpha go via Bravo, but we’ll not worry about that direction for the purposes of this discussion.)

Carrier pigeons are of course an unreliable mechanism for passing messages, so let’s model this as a Bernoulli process; every time we send a message from Alpha to Bravo, or Bravo to Charlie, we flip an unfair coin. Heads, the bird gets through, tails it gets lost along the way.

From this we can predict the reliability of passing a message from Alpha to Charlie via Bravo; the probability of failure is the probability that A-B fails or B-C fails (or both). Equivalently, the probability of success is the probability of A-B succeeding and B-C succeeding. This is just straightforward, basic probability; if the probability of success from A-B is, say, 95% and B-C is 96%, then the probability of success from A to C is their product, around 91%.


Aside: Note that I am assuming in this article that pigeons are passing from Bravo to Charlie even if a pigeon failed to arrive from Alpha; I’m not trying to model in this system constraints like “pigeons only fly from Bravo to Charlie when one arrived from Alpha”.


Now let’s add an extra bit of business.

We have an observer on the mountaintop at Fort Delta overlooking Alpha, Bravo and Charlie. Delta has some high-power binoculars and is recording carrier pigeon traffic from Alpha to Bravo and Bravo to Charlie. But here’s the thing: Delta is an unreliable observer, because observing carrier pigeons from a mountaintop is inherently error-prone; sometimes Delta will fail to see a pigeon. Let’s say that 98% of the time, Delta observes a pigeon that is there, and Delta never observes a pigeon that is not there.

Every so often, Delta issues a report: either “the channel from Alpha to Charlie is healthy” if Delta has observed a pigeon making it from Alpha to Bravo and also a pigeon making it from Bravo to Charlie. But if Delta has just failed to observe either a pigeon going from Alpha to Bravo, or a pigeon going from Bravo to Charlie, then Delta issues a report saying “the channel from Alpha to Charlie is unhealthy“.

The question now is: suppose Delta issues a report that the Alpha-Charlie channel is unhealthy. What is the probability that a pigeon failed to get from Alpha to Bravo, and what is the probability that a pigeon failed to get from Bravo to Charlie? Each is surely much higher than the 5-ish percent chance that is our prior.

We can use the gear we developed in the early part of my Fixing Random series to answer this question definitively, but before we do, make a prediction. If you recall episode 16, you’ll remember that you can have a 99% accurate test but the posterior probability of having the disease that the test diagnoses is only 50% when you test positive; this is a variation on that scenario.

Rather than defining multiple enumerated types as I did in earlier episodes, or even using bools, let’s just come up with a straightforward numeric encoding. We’ll say that 1 represents “a pigeon failed to make the journey”, and 0 means “a pigeon successfully made the journey” — if that seems backwards to you, I agree but it will make sense in a minute.

Similarly, we’ll say that 1 represents “Delta’s attempt to observe a pigeon has failed”, and 0 as success, and finally, that 1 represents Delta making the report “the channel is unhealthy” and 0 represents “the channel is healthy”.

The reason I’m using 1 in all these cases to mean “something failed” is because I want to use OR to get the final result. Let’s build our model:

var ab = Bernoulli.Distribution(95, 5);
var bc = Bernoulli.Distribution(96, 4);
var d = Bernoulli.Distribution(98, 2);
  • 5% of the time ab reports 1: the pigeon failed to get through.
  • 4% of the time, bcreports 1: the pigeon failed to get through.
  • 2% of the time, d reports 1: it fails to see a pigeon that is there.

Now we can ask and answer our question about the posterior: what do we know about the posterior distribution of pigeons making it from Alpha to Bravo and Bravo to Charlie? We’ll sample from ab and bc once to find out if a pigeon failed to get through, and then ask whether Delta failed to observe the pigeons.

What is the condition that causes Delta to report that the channel is unhealthy?

  • the pigeon from Alpha to Bravo failed, OR
  • the pigeon from Bravo to Charlie failed, OR
  • Delta failed to observe Alpha’s pigeon, OR
  • Delta failed to observe Bravo’s pigeon.

We observe that Delta reports that the channel is unhealthy, so we’ll add a where clause to condition the result, and then print out the resulting posterior joint probability:

var result = from pab in ab
             from pbc in bc
             from oab in d
             from obc in d
             let report = pab | pbc | oab | obc
             where report == 1
             select (pab, pbc);
Console.WriteLine(result.ShowWeights());

This is one of several possible variations on the “noisy OR distribution” — that is, the distribution that we get when we OR together a bunch of random Boolean quantities, but where the OR operation itself has some probabilistic “noise” attached to it.


Aside: That’s why I wanted to express this in terms of OR-ing together quantities; of course we can always turn this into a “noisy AND” by doing the appropriate arithmetic, but typically this distribution is called “noisy OR”.


We get these results:

(0, 0):11286    -- about 29%
(0, 1):11875    -- about 30%
(1, 0):15000    -- about 39%
(1, 1):625      -- less than 2%

Remember that (0, 0) means that pigeons did make it from Alpha to Bravo and from Bravo to Charlie; in every other case we had at least one failure.

It should not be too surprising that (1, 1) — both the Alpha and Bravo pigeons failed simultaneously — is the rarest case because after all, that happens less than 9% of the cases overall, so it certainly should happen in some smaller percentage of the “unhealthy report” cases.

But the possibly surprising result is: when Delta reports a failure, there is a 29% chance that this is a false positive report of failure, and in fact it is almost as likely to be a false positive as it is to be a dropped packet, I mean lost pigeon, between Bravo and Charlie!

Put another way, if you get an “unhealthy” report, 3 times out of 10, the report is wrong and you’ll be chasing a wild goose. Just as we saw with false positives for tests for diseases, if the test failure rate is close to the disease rate of the population, then false positives make up a huge percentage of all positives.

My slip-up there of course illuminates what you figured out long ago; all of this whimsy about forts and pigeons and mountains is just a silly analogy. Of course what we really have is not three forts connected by two carrier pigeon routes, but ten thousand machines and hundreds of routers in a data center connected by network cabling in a known topology. Instead of pigeons we have trillions of packets. Instead of an observer in a fort on a mountaintop, we have special supervising software or hardware that is trying to detect failures in the network so that they can be diagnosed and fixed. Since the failure detection system is itself part of the network, it also is unreliable, which introduces “noise” into the system.

The real question at hand is: given prior probabilities on the reliability of each part of the system including the reporting system itself, what are the most likely posterior probabilities that explain a report that some part of the network is unhealthy?

This is a highly practical and interesting question to answer because it means that network engineers can quickly narrow down the list of possible faulty components given a failure report to the most likely culprits.  The power of probabilistic extensions in programming languages is that we now have the tools that we need to concisely express both those models and the observations that we need explanations for, and then generate the answers automatically.

Of course I have just given some of the flavor of this problem space and I’m sure you can imagine a dozen ways to make the problem more interesting:

  • what if instead of a coin flip that represent “dropped” or “delivered” on a packet, we had a more complex distribution on every edge of the network topology graph — say, representing average throughput?
  • How does moving to a system with continuous probabilities change our analysis and the cost of producing that analysis?
  • And so on.

We can use the more complex tools I developed in my series, like the Metropolis method, to solve these harder problems; however, I think I’ll leave it at that for now.

Advertisements