Fixing Random, part 4

[Code is here.]


Last time on FAIC I showed how we could use a simple interface to cleanly implement concepts like “normal distribution” or “standard uniform distribution” in a natural style using fluent programming. For the next many episodes we’ll leave continuous distributions aside and consider only discrete distributions. That is, we have some probability distribution where there are only a small, fixed number of outcomes. Flip a coin, roll three dice and add them together, that sort of thing.

Since we assume the number of possible outcomes of a discrete distribution is small, we can enumerate them; the set of possible outcomes is, oddly enough, called the support of the distribution.

Some elements might be more likely than others; the ones that happen more often we say have more weight than the ones that happen less often. I want to be able to describe the weight of each member of a distribution easily, and I do not want to get off in the weeds of representing weights as doubles because double arithmetic has many pitfalls as we well know. For the purposes of this series I’m going to assume that we can weight the members of a distribution by an integer. (If I get around to it I might discuss double-weighted distributions later on.)

public interface IDiscreteDistribution<T> : IDistribution<T>
{
  IEnumerable<T> Support();
  int Weight(T t);
}

I’m going to require that the weight function accept any value of type T. If the value is not in the support, the weight is zero, and if the value is in the support, the weight is a positive integer. I’m going to further assume that the support and weights are small enough that the total weight fits into an integer.


Aside: If you divide the weight function by the total weight (as a double) then you have the probability mass function of the discrete distribution.


I said way back in part one of this series that System.Random has a “candy-machine interface” problem. Let’s solve that problem right now:

using SCU = StandardContinuousUniform;
public sealed class StandardDiscreteUniform :
  IDiscreteDistribution<int>
{
  public static StandardDiscreteUniform Distribution(
    int min, int max)
  {
    if (min > max)
      throw new ArgumentException();
    return new StandardDiscreteUniform(min, max);
  }
  public int Min { get; }
  public int Max { get; }
  private StandardDiscreteUniform(int min, int max)
  {
    this.Min = min;
    this.Max = max;
  }
  public IEnumerable<int> Support() =>
    Enumerable.Range(Min, 1 + Max  Min);
  public int Sample() =>
    (int)((SCU.Distribution.Sample() *
      (1.0 + Max  Min)) + Min);
  public int Weight(int i) =>
(Min <= i && 
i <= Max) ? 1 : 0;
  public override string ToString() =>
    $”StandardDiscreteUniform[{Min}{Max}]”;
}

Finally we can now do things like 10d6 (D&D notation for “sum ten six-sided die rolls”) in a natural, fluent style. I would far prefer:

SDU.Distribution(1, 6).Samples().Take(10).Sum()

to

random.Next(1, 7) + random.Next(1, 7) + random.Next(1, 7) + …

or the equivalent loop nonsense that you typically see.

A common defense of “lower bound inclusive, upper bound exclusive” is that it is natural for programmers in C-family languages to think of an array or list as indexed from 0 (inclusive) to the array length (exclusive). Apparently we want to be able to say things like

myArray[Random.Next(0, myArray.Length)]

and not get confused and have to say myArray.Length-1.  Well, my answer to that (germane) critique is: once again, we’re reasoning at the wrong level. If we want a probability distribution drawn from elements of an array, then don’t mess around with array indexing and choosing the right random number and so on. That is writing code at the level of the mechanisms, not the business domain. Instead, make an object that represents a distribution of elements chosen from a given array and sample from it!

But we are getting ahead of ourselves; we’ll get to that in an upcoming episode.

Previously I made a little “histogram string” helper extension method that is useful for quickly displaying results in a debugger or console; let’s do the same for discrete distributions. Moreover, to make sure that we really are getting the results we expect, I’m going to implement it by sampling rather than just basing it on the support and the corresponding weights:

public static string Histogram<T>(
  this IDiscreteDistribution<T> d) =>
  d.Samples().DiscreteHistogram();

public static string DiscreteHistogram<T>(
  this IEnumerable<T> d)
{
  const int sampleCount = 100000;
  const int width = 40;
  var dict = d.Take(sampleCount)
    .GroupBy(x => x)
    .ToDictionary(g => g.Key, g => g.Count());
  int labelMax = dict.Keys
    .Select(x => x.ToString().Length)
    .Max();
  var sup = dict.Keys.OrderBy(x => x).ToList();
  int max = dict.Values.Max();
  double scale = max < width ? 1.0 : ((double)width) / max;
  return string.Join(“\n”,
    sup.Select(s => $”{ToLabel(s)}|{Bar(s)}));
  string ToLabel(T t) =>
    t.ToString().PadLeft(labelMax);
  string Bar(T t) =>
    new string(‘*’, (int)(dict[t] * scale));
}

I gotta say, I am loving local functions in C# 7.

Let’s take it for a test drive and roll 1d10:

StandardDiscreteUniform.Distribution(1, 10).Histogram()

The histogram appears as we’d expect:

 1|***************************************
 2|***************************************
 3|***************************************
 4|**************************************
 5|***************************************
 6|****************************************
 7|**************************************
 8|***************************************
 9|***************************************
10|**************************************

Of course, sometimes we will want to see the weights directly:

public static string ShowWeights<T>(
  this IDiscreteDistribution<T> d)
{
  int labelMax = d.Support()
    .Select(x => x.ToString().Length)
    .Max();
  return string.Join(“\n”, d.Support().Select(s =>
    $”{ToLabel(s)}:{d.Weight(s)}));
  string ToLabel(T t) =>
    t.ToString().PadLeft(labelMax);
}

We seem to be doing pretty well here; our implementations of each probability distribution are short and sweet, and each one provides a stone in the foundation for building more complex distributions.


Next time on FAIC: we’ll build more simple discrete distributions.

Advertisements

13 thoughts on “Fixing Random, part 4

  1. I realize this series is about API design moreso than the implementation details, but if I can nitpick a bit anyway: doing SupportLength * Sample52Bits() / 2^52 and truncating to an integer is generally going to be very slightly biased, right? Have you considered doing a rejection sampling approach similar to what you did in the implementation of NextDouble()?

    • I’m not seeing the SupportLength * Sample52Bits() / 2^52 bit in the article so maybe he changed it, but the rejection approach wasn’t actually required in NextDouble() so it’s not required here either. Sample52Bits() / 2^52 produces a double uniformly drawn from [0,1) with as much precision as we can get in a double.

      Also, (int)((1.0 + Max – Min) * Sample()) + Min does a fine job of choosing a random integer from [Min,Max]. If we take r = Sample() and c = Max-Min and d = c+1, since then r*d is a random double from [0,d) and truncating to int produces [0,c]. As for bias, you can consider the first and last values. The first value, 0, is produced by (int)(r*d) if 0 <= r < 1/d. So the real number space that maps to zero has a "width" of 1/d – 0 = 1/d. Now, the last value, c is produced from (int)(r*d) when c/d <= r < d/d. (If r < c/d then r*d = d/d (i.e. (c+1)/d) then r*d >= c+1.) So the “width” of the real number space mapping to c is d/d – c/d = (c+1)/d – c/d = 1/d. You can check all the other values similarly, and they have equal real number spaces assigned to them. So if double does a good job of approximating real numbers then it should be unbiased.

      That said, some minor bias will exist whenever d is not a power of two since you’ll get an uneven number of mantissa values mapping to each integer value, but I believe that bias will be “smeared” across the output space. Anyway, since the output space is 2^32 at most and the mantissa has 2^52 random bits, the bias is roughly 1 in (2^52-2^32) = 1 in 2^20 at most, which is very small. Using this method to generate random longs, however, would allow the bias to become very significant with large ranges. You could switch to a larger floating-point type, then, like a double-double giving an effective 105 random bits (IIRC).

      If you want truly unbiased output then I guess you’d have to generate an integer with ceiling(log2(d)) random bits and reject any result >= d… but it’s so ugly. 🙂

      • It wasn’t a literal quote from the source, but it’s what SCU.Distribution.Sample() *
        (1.0 + Max – Min) boils down to since Sample() calls down to NextDouble() which is implemented by taking a bunch of random bits and dividing by 2^52. Sorry if that was unclear.

        I don’t follow your reasoning about biases. Imagine if the floating point was 3 bits in length. Then a 1d6 sample would essentially be generated by taking a uniformly random 3-bit unsigned integer, multiplying by 0.75, rounding down, and adding 1. It so happens in this case that rolling 1 or 4 becomes twice as likely as any other number.

        Obviously this effect will be way smaller when using 52 bit samples, but in principle the bias still exists. By the pigeonhole principle if the length of the range Max + 1 – Min does not divide 2^52 then some results in the [Min, Max] range will be more likely to be generated than others.

        I considered it a nitpick here because it’s not the primary focus of the blog series as I understood it, but if I were implementing an RNG API for general purpose use I would personally not find it acceptable to introduce this bias here, however small it may be in most cases.

  2. There is another silly problem with “lower bound inclusive, upper bound exclusive” API: you can’t get the uniform distribution on the [MinValue; MaxValue] interval. I suspect that’s the reason why whenever I had seen “unsigned char genRandomByte(unsigned char low = 0, unsigned char high = 255);” subroutine, it always had the “upper bound inclusive” semantics. Of course, the implementations that reduce a random integer modulo (high – low + 1) have to check for possible overflow in order to not divide by zero but that’s expected since the only way to perform division by (MaxValue + 1) is to explicitly do nothing.

  3. Hi, and thanks for your shared work.
    I found your C# script (https://blogs.msdn.microsoft.com/ericlippert/2005/04/15/desafinado-part-four-rolling-your-own-wav-files/) making riff wave sound.
    I’m not a specialist of wave sound.
    I need to transform a text file to a wav sound. And i don’t know how to make it perfectly!
    Can you help me. Peraphs you can do that with few time.
    I need for example to take alphabetic char and convert one by one each char with a frequency.
    At this time i can read a text file, get it’s utf8 code, and write it to the file.
    But it’s not correct.
    Can you tell me how to change the letter [a-zA-Z] to a known frequency.
    Thanks a lot for your help.
    Sorry to post here

  4. StandardDiscreteUniform doesn’t seem to work for negative minimum values, since the double to integer conversion in Sample rounds towards zero, giving a zero probability of drawing Min and double the normal probability of drawing zero.

    Also, DiscreteHistogram prints negative values last, which doesn’t look as nice as it could. Is there any reason it sorts by ToLabel instead of using the default comparer for T?

        • I agree that the label sorting could be better I’m not following why you think the discrete distribution is wrong when Min is negative. Suppose Min is -1 and Max is 8, so we should get a distribution of equally many of -1, 0, 1, 2, 3, 4, 5, 6, 7, 8. We start by computing 1.0 + Max – Min, which is 10.0. We sample a random number on the half open interval [0, 1.0), multiply that by 10.0, so we get a random number on [0.0, 10.0). We round that down, to an integer, giving us 0, 1, 2, 3, 4, 5, 6, 7, 8, or 9, and add Min to that, giving -1, 0, 1, 2, 3, 4, 5, 6, 7, or 8. Where’s the bug?

          • The bug is in that the code adds Min first, giving [-1.0, 9.0), then rounds that towards zero, giving 0, 0, 1, 2, 3, 4, 5, 6, 7, or 8.

          • I have created a pull request for your GitHub repository with a fix for the problem. You can look at a histogram of StandardDiscreteUniform.Distribution(-3, 3) with and without the fix to verify.

  5. Pingback: Fixing Random, part 5 | Fabulous adventures in coding

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s