Fixing Random, part 3

Last time on FAIC I described a first attempt of how I’d like to fix System.Random:

Make every method static and threadsafe
Draw a clear distinction between crypto-strength and pseudo-random methods

There were lots of good comments on ways to improve the API and implementation details further. However, even with those improvements, we’re still really just implementing slightly better versions of rand() from the 1970s. We can do better than that.

Let’s start by taking a step back and asking what we really want when we’re calling NextInt or NextDouble or whatever on a source of randomness. What we’re saying is: we have a source of random Ts for some type T, and the values produced will correspond to some probability distribution. We can express that very clearly in the C# type system:

public interface IDistribution<T>
{
T Sample();
}

For this episode I’m just going to look at continuous probability distributions, which I’ll represent as IDistribution<double>. We’ll look at integer distributions and other possible values for type parameter T in the upcoming episodes.

Now that we have an interface, what’s the simplest possible implementation? We have a library that produces a uniform distribution of doubles over the interval [0.0, 1.0), which is called the standard continuous uniform distribution. Let’s make a little singleton class for that:

using SCU = StandardContinuousUniform;
public sealed class StandardContinuousUniform :
IDistribution<double>
{
  public static readonly StandardContinuousUniform
Distribution = new StandardContinuousUniform();
  private StandardContinuousUniform() { }
  public double Sample() => Pseudorandom.NextDouble();
}

Aside: For the purposes of this series I’m going to use pseudo-random numbers; feel free to sub in crypto-strength random numbers if you like it better.

Exercise: make this a non-singleton that takes a source of randomness as a parameter, and feel all fancy because now you’re doing “dependency injection”.

This implementation might seem trivial — and it is — but upon this simple foundation we can build a powerful abstraction.

Now that we have a type that represents distributions, I’d like to build a bridge between distributions and LINQ. The connection between probability distributions and sequences should be pretty obvious; a distribution is a thing that can produce an infinite sequence of values drawn from that distribution.

I’ll make a static class for my helper methods:

public static class Distribution
{
  public static IEnumerable<T> Samples<T>(
this IDistribution<T> d)
  {
  while (true)
  yield return d.Sample();
  }

What does this give us? We instantly have access to all the representational power of the sequence operation library. Want a sum of twelve random doubles drawn from a uniform distribution? Then say that:

SCU.Distribution.Samples().Take(12).Sum()

Aside: You might be thinking “if there is such a strong connection between distributions and infinite sequences then why not cut out the middleman by making IDistribution extend the IEnumerable interface directly, rather than requiring a Samples() helper method?” That’s an excellent question; the answer will become clear in a few episodes.

For testing and pedagogic purposes I’d like to be able to very quickly visualize a distribution from the console or in the debugger, so here’s another couple extension methods:

public static string Histogram(
this IDistribution<double> d, double low, double high) =>
d.Samples().Histogram(low, high);

public static string Histogram(
this IEnumerable<double> d, double low, double high)
{
  const int width = 40;
  const int height = 20;
  const int sampleCount = 100000;
  int[] buckets = new int[width];
  foreach(double c in d.Take(sampleCount))
  {
  int bucket = (int)
(buckets.Length * (c – low) /
(high – low));
  if (0 <= bucket && bucket < buckets.Length)
  buckets[bucket] += 1;
  }
  int max = buckets.Max();
  double scale =
max < height ? 1.0 : ((double)height) / max;
  return string.Join(“”,
  Enumerable.Range(0, height).Select(
r => string.Join(“”, buckets.Select(
b => b * scale > (height – r) ? ‘*’ : ‘ ‘)) + “\n”))
+ new string(‘-‘, width) + “\n”;
}

Exercise: Rescue the above princess in a single LINQ expression without getting Jon to do it for you. (I haven’t actually tried this; I’d be interested to know if it is possible. 🙂 )

We can now do this:

SCU.Distribution.Histogram(0, 1)

and get the string:

**** ********** *    *   * * *******   *
****************************************
****************************************
****************************************
****************************************
****************************************
****************************************
****************************************
****************************************
****************************************
****************************************
****************************************
****************************************
****************************************
****************************************
****************************************
****************************************
****************************************
****************************************
----------------------------------------

Which obviously is the desired probability distribution histogram, so I’m pretty pleased.

Extension methods and fluent programming are nice and all, but what else does this get us? Well, we can very concisely produce derived probability distributions. For example, suppose we want the normal distribution:

using static System.Math;
using SCU = StandardContinuousUniform;
public sealed class Normal : IDistribution<double>
{
  public double Mean { get; }
  public double Sigma { get; }
  public double μ => Mean;
  public double σ => Sigma;
  public readonly static Normal Standard = Distribution(0, 1);
  public static Normal Distribution(
double mean, double sigma) =>
  new Normal(mean, sigma);
  private Normal(double mean, double sigma)
  {
  this.Mean = mean;
  this.Sigma = sigma;
}
  // Box-Muller method
  private double StandardSample() =>
  Sqrt(–2.0 * Log(SCU.Distribution.Sample())) *
Cos(2.0 * PI * SCU.Distribution.Sample());
  public double Sample() => μ + σ * StandardSample();
}

And of course we can graph it:

Normal.Distribution(1.0, 1.5).Histogram(–4, 4)

                        ***             
                      ******            
                     ********           
                    **********          
                    ***********         
                   ************         
                  **************        
                  **************        
                 ****************       
                ******************      
                ******************      
               ********************     
              **********************    
             ************************   
             ************************   
           **************************** 
          ******************************
         *******************************
       *********************************
----------------------------------------

If you want a list with a hundred thousand normally distributed values with a particular mean and standard deviation, just ask for it:

Normal.Distribution(1.0, 1.5).Samples().Take(100000).ToList()

Hopefully you can see how much more useful, readable and powerful this approach is, compared to messing around with System.Random directly. We want to move the level of abstraction of the code away of the mechanistic details of loops and lists and whatnot; the code should be at the level of the business domain: distributions and population samples.

Exercise: Try implementing the Irwin-Hall distribution (hint: it’s already written in this episode!)

Exercise: Try implementing the Gamma distribution using the Ahrens-Dieter algorithm.

Exercise: Try implementing the Poisson distribution as an IDistribution<int>.

In the next few episodes of FAIC: we’ll stop looking at continuous distributions and take a deeper look at the use cases for our new interface when the type argument represents a small set of discrete value, like Booleans, small ints or enums.

10 thoughts on “Fixing Random, part 3”

Here’s Histogram in a single LINQ Expression. In the spirit of the exercise I switched from String.Join to string concatenation in Aggregate, although this would be horribly inefficient. There’s also a nasty hack in the middle to get the max value used for scaling later in the expression.

public static string Histogram(
  	this IDistribution d, 
	double low, 
	double high)
{
	const int width = 40;
	const int height = 20;
	const int sampleCount = 100_000;
	int max = 0;
	
	return Enumerable
		.Range(0, width)
		.GroupJoin(
			d.Samples().Take(sampleCount), 
			x => x, 
			s => (int)(width * (s - low) / (high - low)),
			(x, sg) => new { Index = x, Height = sg.Count() })
		.Select(b => new 
			{ 
				b.Index, 
				Height = max  new 
			{ 
				b.Index, 
				Height = b.Height * (max  
			Enumerable
				.Range(0, height)
				.Select(r => new 
					{ 
						Row = r, 
						Bucket = b.Index, 
						Char = b.Height > (height - r) ? '*' : ' ' 
					}))
		.GroupBy(
			x => x.Row, 
			(r, g) => new 
				{ 
					Row = r, 
					Text = g.OrderBy(x => x.Bucket)
							.Aggregate("", (a, x) => a + x.Char) 
				})
		.OrderBy(x => x.Row)
		.Aggregate("", (a, x) => a + x.Text + "\n")
		+ new string('-', width) + "\n";
}

Reply ↓

Another attempt to fix the formatting:

public static string Histogram1(
      this IDistribution d, 
    double low, 
    double high)
{
    const int width = 40;
    const int height = 20;
    const int sampleCount = 100_000;
    int max = 0;
    
    return Enumerable
        .Range(0, width)
        .GroupJoin(
            d.Samples().Take(sampleCount), 
            x => x, 
            s => (int)(width * (s - low) / (high - low)),
            (x, sg) => new { Index = x, Height = sg.Count() })
        .Select(b => new 
            { 
                b.Index, 
                Height = max < b.Height ? (max = b.Height) : b.Height 
            })
        .ToList()
        .Select(b => new 
            { 
                b.Index, 
                Height = b.Height * (max < height ? 1.0 : ((double)height) / max) 
            })
        .SelectMany(b => 
            Enumerable
                .Range(0, height)
                .Select(r => new 
                    { 
                        Row = r, 
                        Bucket = b.Index, 
                        Char = b.Height > (height - r) ? '*' : ' ' 
                    }))
        .GroupBy(
            x => x.Row, 
            (r, g) => new 
                { 
                    Row = r, 
                    Text = g.OrderBy(x => x.Bucket)
                            .Aggregate("", (a, x) => a + x.Char) 
                })
        .OrderBy(x => x.Row)
        .Aggregate("", (a, x) => a + x.Text + "\n")
        + new string('-', width) + "\n";
}

ericlippert on February 8, 2019 at 7:01 am said:

Nice!

Reply ↓

Pingback: Dew Drop – February 8, 2019 (#2895) | Morning Dew
Ryan on February 8, 2019 at 6:58 am said:

Curious: it doesn’t exactly come up often, but seeing an enumerable, unbounded list strikes me as a little dangerous. I can easily see a naive developer writing:
~~~
foreach(var sample in Distribution.Samples()) { … }
~~~

And being quite distressed with their application runs until the heat death of the universe. Is this just a shortcut to facilitate your thought experiments, or do you think app developers should generally be careful when all they know is that they’re getting an IEnumerable back?

Reply ↓
- ericlippert on February 8, 2019 at 7:05 am said:
  
  I think it’s a (minor) design problem with enumerables, but not one I would stress out about too much. “Code runs forever” bugs tend to get caught in testing pretty quickly! (This came up a lot when I was at Coverity; we wanted to focus our efforts on finding bugs that do *not* get detected the first time you run the code; the supposition of the analyzer is that your tests failed to find the defect, and that the tests executed the code.)
  
  I seem to recall a few years ago one of the designers of Scala did a talk on similar problems with Scala collections, and concluded that we ought to have better expressed in the type system when a collection is eager / lazy, finite / infinite, mutable / immutable, and so on. I’ll see if I can find it.
  
  Reply ↓
  - Flo. on February 8, 2019 at 9:25 am said:
    
    I generally agree; but still think we could make it a “pit of success” (which is the whole point of this “rant” on Random).
    
    e.g.:
    public static IEnumerable Samples(
    this IDistribution d,
    int count)
    {
    return Enumerable.Range(1, count)
    .Select(_ => d.Sample());
    }
    
    Reply ↓
    - ericlippert on February 8, 2019 at 10:38 am said:
      
      The problem with that well-intentioned solution is that it now works against other features. With your proposal, how do you write “roll a die until you get a six; sum the results”? With my proposal that’s Die().Samples().TakeWhile(i != 6).Sum(). What does it look like with your proposed implementation of Samples()? Should we make a *second* Samples that takes a predicate?
      
      Reply ↓
- Nico on February 8, 2019 at 12:19 pm said:
  
  Learning Racket (a derivative of Scheme) gave me an strong appreciation of lazy evaluation and infinite lists, so seeing them appear in C# is delightful. I think the best solution to concern here is just documenting in the XML comments for the method that it returns an infinite enumerable. Like with other things, the power it offers outweighs concerns about incorrect usage.
  
  And, like Eric said, infinite loops tend to be one of those things you find pretty quickly 🙂
  
  Reply ↓
Pingback: Fixing Random, part 4 | Fabulous adventures in coding