Benchmarking mistakes, part three

NOTE: I wrote this series of articles for a web site which went out of business soon thereafter. They asked for a beginner-level series on how to benchmark C# programs. I’ve copied the articles as I wrote them below. Part two is here and part four is here.

As I discussed in part one of this series, a C# program actually goes through two compilers before the code runs. First the C# compiler converts the code into “Common Intermediate Language”, or “IL”, which is written into the assembly — the .EXE or .DLL file of the program. Then at runtime, immediately before a method is called for the first time, the aptly-named “just in time” compiler — the jitter, for short — transforms the IL into actual executable machine code. Because computer programmers love to verb words, this process is called “jitting a method”.

The jitter was built to be fast, but it’s not infinitely fast. You should expect that the first time you run a method it will be considerably slower; before it can run for the first time all its IL has to be translated into machine code. And of course, if the methods that it calls in turn are all being called for the first time then they’ve got to be jitted as well.

But wait, it gets worse. Suppose you are calling a method for the first time, and it in turn creates an object from a library. If that library is being used for the first time in this process, not only does the code for the constructor have to be jitted, but before it can be jitted the library has to be found on disk, mapped into virtual memory, analyzed for security problems, and any code in the static constructors of the newly-created object has to be jitted and executed.

In short, calling code for the first time can be massively more expensive than calling it for the second time. So this brings us to…

Mistake #6: Treat the first run as nothing special when measuring average performance.

In order to get a good result out of a benchmark test in a world with potentially expensive startup costs due to jitting code, loading libraries and calling static constructors, you’ve got to apply some careful thought about what you’re actually measuring.

If, for example, you are benchmarking for the specific purpose of analyzing startup costs then you’re going to want to make sure that you measure only the first run. If on the other hand you are benchmarking part of a service that is going to be running millions of times over many days and you wish to know the average time that will be taken in a typical usage then the high cost of the first run is irrelevant and therefore shouldn’t be part of the average. Whether you include the first run in your timings or not is up to you; my point is, you need to be cognizant of the fact that the first run has potentially very different costs than the second.

Let’s illustrate the enormous effect that the jitter can have on the first run of a program by modifying our example from last time. Let’s suppose we’ve written our own implementation of the standard “quicksort” algorithm and we want to benchmark its performance:

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;

static class Extensions
{
  public static void Quicksort(this int[] items)
  {
    Quicksort(items, 0, items.Length - 1);
  } 
  static void Quicksort(int[] items, int leftBound, int rightBound)
  {
    int left = leftBound;
    int right = rightBound;
    int pivotIndex = leftBound + (rightBound - leftBound) / 2;
    int pivot = items[pivotIndex]; 
    while (left <= right)
    {
      // Find an item greater than the pivot
      while (items[left] < pivot)
        left += 1;
      // Find an item less than the pivot
      while (pivot < items[right])
        right -= 1;

      // Swap them
      if (left <= right)
      {
        Swap(ref items[left], ref items[right]);
        left += 1;
        right -= 1;
      }
    }

    // Everything less than the pivot is now left of the pivot.
    // Sort this portion.
    if (leftBound < right)
      Quicksort(items, leftBound, right);
    // Same for the right.
    if (left < rightBound)
      Quicksort(items, left, rightBound);        
  }
  private static void Swap(ref int x, ref int y)
  {
    int t = x;
    x = y;
    y = t;
  } 
}

That’s a pretty standard implementation of quicksort in C#. Now let’s make some modifications to our benchmark test of last time. Here we’ll make a random list, copy it to an array twice so that we know that both runs will be operating on exactly the same data, and see if we can deduce the cost of jitting those methods the first time they’re called:

class P
{        
  static void Main()
  {
    var original = new List<int>();
    const int size = 1000;
    var random = new Random();
    for(int i = 0; i < size; ++i)
      original.Add(random.Next());

    var arr1 = original.ToArray();
    var arr2 = original.ToArray();

    var stopwatch = new System.Diagnostics.Stopwatch();
    stopwatch.Start();
    arr1.Quicksort();
    stopwatch.Stop();
    Console.WriteLine(stopwatch.Elapsed);
    stopwatch.Reset();
    stopwatch.Start();
    arr2.Quicksort();
    stopwatch.Stop();
    Console.WriteLine(stopwatch.Elapsed);
  }
}

When I run this on my laptop (again, remembering to compile into release mode, and running without the debugger attached) a typical result is 3500 microseconds for the first run and 700 microseconds for the second run; in other words, the first run here took roughly five times longer than the second run on average. It must have taken the jitter about 2.8 milliseconds on average to find the IL and translate it into efficient machine code.

Of course, that factor is relative, and I chose the array size somewhat arbitrarily. If we were sorting an array with a million elements then the ~3 millisecond jit cost would be a barely-noticeable bump. If we were sorting an array with twenty elements then the jit cost wouldn’t be five times worse; it could be a hundred times worse.

Moreover, it’s important to note that different jitters give different results on different machines and in different versions of the .NET framework. The time taken to jit can vary greatly, as can the amount of optimization generated in the machine code. The jit compilers on the Windows 32 bit desktop, Windows 64 bit desktop, Silverlight running on a Mac, and the “compact” jitter that runs when you have a C# program in XNA on XBOX 360 all have potentially different performance characteristics. That’s…

Mistake #7: Assuming that runtime characteristics in one environment tell you what behavior will be in a different environment.

Run your benchmarks in the actual environment that the code will be running in; use machines that have the same hardware and software that will typically be used by the customers who ultimately will run the code.

Next time in this series we’ll take a look at how the garbage collector can affect performance benchmarks.

9 thoughts on “Benchmarking mistakes, part three”

Douglas on July 9, 2013 at 9:00 am said:

Mistake #7?

Using a random number generator in benchmarking with a random seed.

When testing two sets of code, or making subsequent test runs on the same code, using the same set of “random” values each time is desirable so that both runs process the same values. You can do this by specifying the seed to the RNG.

Reply ↓
- Eric Lippert on July 9, 2013 at 9:05 am said:
  
  You have anticipated part five. 🙂
  
  Reply ↓
  - Michael Starberg on July 10, 2013 at 12:06 pm said:
    
    “Not knowing the upper bound” is another mistake. But also a feature for you, Eric. – Oh, I’ll cover that in in part n+m. =)
    
    Reply ↓
- John Payson on July 10, 2013 at 11:50 am said:
  
  Depending upon the algorithm, I would think using a random seed would sometimes be a good thing and sometimes not. If a given sequence of random numbers will end up yielding the same general sequence of operations in the algorithms being tested, using a fixed seed may be good. On the other hand, if one is trying to compare the relative performance of e.g. two very different sorting algorithms, it would be possible that a fixed seed might consistently yield a sequence that usually favors one or disfavors the other. A random seed sometimes yield such sequences, of course, but a fixed seed that did so consistently would pose a bigger problem. I would suggest that even if one uses random seeds, one logs the values that are actually used to allow particular results to be reproduced (e.g. if one run of a program is excessively slow, retrying it with the same seed would reveal if it was an algorithmic problem or a system hiccup).
  
  Reply ↓
  - Chris B on July 10, 2013 at 1:20 pm said:
    
    Sending random inputs to the algorithm is a type of fuzzing. Although it’s usually applied in security testing, it could apply quite well in performance testing (although I’ve never heard anyone mention it in that context). In both cases, the goal is to eliminate the biases that can be present in hand-written tests.
    
    Reply ↓
    - Brian on July 10, 2013 at 1:51 pm said:
      
      Normally, when doing such testing, you will still record the seed. If you are using truly random data (e.g., from random.org), you could store the associated generated entropy, or wrap your random generator and have the ability to record and replay random numbers.
      
      Reply ↓
Kyle Szklenski on July 11, 2013 at 10:26 am said:

“Because computer programmers love to verb words”…

So meta, man. So meta. I love it. You verbed us good!

Reply ↓
Pingback: Benchmarking mistakes, part two | Fabulous adventures in coding
Pingback: Benchmarking mistakes, part four | Fabulous adventures in coding