How much bias is introduced by the remainder technique?

(This is a follow-up article to my post on generating random data that conforms to a given distribution; you might want to read it first.)

Here’s an interesting question I was pondering last week. The .NET base class library has a method Random.Next(int) which gives you a pseudo-random integer greater than or equal to zero, and less than the argument. By contrast, the rand() method in the standard C library returns a random integer between 0 and RAND_MAX, which is usually 32768. A common technique for generating random numbers in a particular range is to use the remainder operator:

int value = rand() % range;

However, this almost always introduces some bias that causes the distribution to stop being uniform. Do you see why?

Let’s suppose that rand() generates a uniform probability distribution; for any call we have a 1 / 32768 chance of getting any particular number. And let’s suppose that range is 3. What is the distribution of rand() % 3? There are 10923 numbers between 0 and 32767 that produce a remainder of 0, 10923 that produce a remainder of 1 and 10922 that produce a remainder of 2, so 2 is very slightly less likely, and 0 and 1 are very slightly more likely. We’ve introduced a bias towards smaller numbers.

This difference is tiny; so small that you would not notice it without millions of data points. But now let’s suppose that we want to generate a random number between 0 and 19999 using this technique, and see what the bias is. How many ways are there to get 0? We could generate 0 or 20000, so 2. How many ways are there to get 1? We could generate 1 or 20001. And so on, up to 12767. But there is only one way to get 12768 out of rand() % 20000! Every number between 0 and 12767 is twice as likely as every number from 12768 to 19999; this is a massive bias.

How can we quantify how massive or tiny the bias is? One technique is to graph the desired probability distribution[1. Here shown as a continuous function, though of course it is a collection of discrete data points.] and the actual probability distribution, and then take the area of their difference.

bias1

Here the red line is the desired probability distribution: an even 1/20000 for each element. The blue line is the probability of getting every element with rand() % 20000, and the shaded area is the difference between the desired and actual distributions. The area under both distribution lines is 100%, as you would expect for a probability distribution. A few quick calculations shows that the shaded area is a massive 28%! By contrast if we did the same exercise for rand() % 3 the shaded area would be 0.004%.[1. Of course the two shaded areas are equal since the total area under both curves is the same. We could just compute one of the areas and double it.]

I computed the “bias area” for every range value from 1 to 32768; the result is an interesting shape:

bias2

Every power of two gives no bias at all; half way between each power of two gives an every-increasing bias area.

The moral of the story is: when using the remainder technique, you’re introducing a bias towards small numbers, and that bias can get quite large if the range is large. Even a range of a few thousand is already introducing a 5% bias towards smaller numbers.

So what then is the better technique? First off, obviously if you have a library method that does the right thing, use it! The .NET implementation of Random.Next(int) has been designed to not have this bias.[2. It does not use the technique I describe here. Rather, it generates a random double and then multiplies that by the desired range, then rounds to an integer. If the range is very large then special techniques are used to ensure good distribution.] If you must use the remainder technique then my advice is to simply discard the results that would introduce the bias. Something like:

while(true)
{
  int value = rand();
  if (value < RAND_MAX - RAND_MAX % range)
    return value % range;
}
About these ads

23 thoughts on “How much bias is introduced by the remainder technique?

  1. From the graph, the Nth best case is 2^N. The worst case between the N and N+1 best cases is 2^N + 2^(N-1), i.e. halfway between the two.

    As the bias gets greater and greater, how does your discard-bias algorithm perform on the worst cases asymptotically?

    • The algorithm doesn’t depend on the size of the bias, it depends on the proportion of values which produce bias. In the worse case (range = (MAX_RANGE / 2) + 1) it takes 2 (amortized) calls to rand() per randInt(range)

    • Brock’s analysis is correct. To expand on that.

      In the worst case fully one half of the values are discarded. The probability of going one round is therefore 1/2. The probability of going two rounds is 1/2 * 1/2 = 1/4. The probability of going three rounds is 1/2 * 1/2 * 1/2 = 1/8. In general the probability of going n rounds is 1/2n. So in practice it is very rare to go more than a small number of rounds.

      What is the expected number of rounds? That is ( 1 * 1/2 + 2 * 1/4 + 3 * 1/8 + 4 * 1/16 …) This is an arithmetic-geometric series that converges; the value it converges to is r / (1-r)2, where r in this case is 2, so the sum is 2. We expect on average to get a result in 2 rounds.

      Now, that is the “worst case” in terms of the range which causes the most items to be discarded. The truly worst case for this algorithm is that the random number generator produces numbers outside the given range forever, in which case it is not a very good random number generator! One could add a fail-safe to the loop so that if it looped more than five times, bail out and return a possibly biased value.

  2. The algorithm shown is simple, and eliminates bias, but it also throws out a lot of the data generated by the random generator. For example, if one wants numbers in the range 0 to 16384 inclusive, the algorithm will need to generate an average of almost 30 bits of random data for each number output, even though an average of only 14.000089 bits of data should be required. Are you going to examine ways of capturing left-over randomness on each stage rather than discarding it?

    • For this simple sketch I wasn’t going to consider that, no. Pseudo-random numbers are cheap, after all.

      One approach would be to compute the “bias area” and only start looping if it is greater than some undesirable amount. In your example we’d then not be throwing away half the bits.

      • Pseudo-random numbers are cheap, but some sources of real randomness aren’t. If one sometimes wants random numbers from 0 to 5, and sometimes from 0 to 999,999 (inclusive), and one’s random source is e.g. a hardware RNG that costs 1ms per 8 bits, there may be some value in capturing all the randomness one can get. Plus (at least IMHO) it’s a cute little design challenge. Roughly a dozen lines of code should suffice I think for a generator which can generate numbers for any mix of big and small ranges, provided only that the product of RAND_MAX and the largest range of interest will fit in whatever integer type one is using for two state variables.

    • No! See the video I linked to below, starting at 6:02. You just introduced a massive bias against the last number in your range.

      • Well, that’s exactly what System.Random.Next(int) does (code shown after some inlining):

        return (int)((double)this.Next() * 4.6566128752457969E-10 * (double)maxValue);

        That funny constant is 1/2^31. So… don’t use System.Random.Next(int)?

        • If it does that then yes, very bad. Although contrary to your method you’ll notice that it actually uses (RAND_MAX + 1) which avoids the obvious problem but still is not perfect because floats are not uniformly distributed.

          STL shows the empiric test results for that.

          • I think that solution is “good enough.” While not 100% perfect, one number being vastly under reported in a large range of possible values is not the end of the world. If near ideal results are required by the problem at hand, then a true cryptographic library should be employed.

    • Since floating point numbers aren’t uniformly distributed themselves this presumably isn’t perfectly uniform either. But contrary to the modulo problem, the math for that one seems more intricate.

  3. The reason why using modulo to constrain your random number to a range is bad has less to do with bias and more to do with the typical implementation of rand(), a linear congruential generator (LCG). Most language runtimes use LCGs for their random function; only very recently designed languages tend to differ.

    An LCG is just a multiply and an add (the modulo usually being implemented via an integer’s maximum size). It should be obvious that the low bits of such a sequence follow a regular pattern – the multiply doesn’t mix higher bits into the lower bits, and the add mutates the low bits in a constant way every iteration.

    And FWIW, you don’t need to convert to double to implement Random.Next(Int32) in the fractional style. If you can generate random numbers with e.g. a 32-bit unsigned range, all you need to do is a 64-bit multiply, and take the high 32 bits as your result.

    • Could the fact that upper LCG bits are “better” than lower ones be counteracted if n LCG-based rand() used a linear-congruential generator internally via something like `return ((seed >> 16) ^ seed) & 0x7FFF;` [perhaps using other shift amounts as well on processors which can handle them cheaply]? Adding such things into the “feedback loop” would risk having it fall into short cycles, but if `seed` is updated each time using an LCG, I would think that munging its output through almost any bijective function should be safe.

      • The Windows implementation of rand() already does something like this (but a bit simpler). It has 32 bits of internal state, but a call to rand() gives you 15 high bits instead of giving you low bits.

  4. The default .NET Random implementation isn’t very good (see for instance http://connect.microsoft.com/VisualStudio/feedback/details/634761/system-random-serious-bug, though I’ve run into this doing simulations).

    If you’re interested in fixing bias, it’s probably also worth pointing out that there are many freely available alternatives; mt19937 is a popular alternative; http://www.math.sci.hiroshima-u.ac.jp/~m-mat/MT/VERSIONS/C-LANG/c-lang.html lists several implementations.

  5. Pingback: Java, RNG, Raster & Maps | Kynosarges Weblog

  6. Pingback: Dew Drop – December 17, 2013 (#1685) | Morning Dew

  7. So this is further proof that the taught and accepted wisdom that, when writing a Hash Table implementation, using a prime number as your array size doesn’t work well for computers. A power of 2 will be faster (both for memory access and for the index calculations) and will give you a better distribution of indices.

    Ok, so not many people have cause to write such a thing these days, but it’s still useful to know.

  8. The double based technique avoids this particular bias, but it causes other biases. My favourite demonstration is:

    r.Next(1431655765)%2

    This should obviously return the results 0 and 1 with (nearly) equal probability. What it actually does is returning 0 in with probability 2/3 and 1 with probability 1/3.

    See http://stackoverflow.com/a/6842191/445517 a full example and the rest of my rant about System.Random ;)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s