ATBG: Why do enumerators avoid a bounds check?

Posted on June 4, 2014 by ericlippert

I am back from a week in which I visited England, Holland, Belgium, France and Hungary; this is by far the most countries I’ve been to in so little time. It was great to meet so many customers; more on bicycling around Budapest in a future episode. For today though, I’ve posted on the Coverity Development Testing Blog’s continuing series Ask The Bug Guys a follow up to the previous episode. I’ll discuss why the struct-based list enumerator does not do bounds checking in its implementation of the Current property. Check it out! Thanks to reader Xi Chen for the question. Continue reading →

Benchmarking mistakes, part four

Posted on August 19, 2013 by ericlippert

NOTE: I wrote this series of articles for a web site which went out of business soon thereafter. They asked for a beginner-level series on how to benchmark C# programs. I’ve copied the articles as I wrote them below. Part three is here.

This was supposed to be a five-part series; when I supplied part four, I asked when they would like part five. The answer I got back was something to the effect of “your editor X is no longer with the company; Y will be your new editor and will get back to you with a schedule for part five soon”, and that was the last I heard and then they went out of business. So, sorry, part five never got written. More on that below.

Last time in this series we talked about how forgetting to “charge” the expense of the jitter to your analysis can lead to inaccurate results: since the first run of a method incurs the cost of the jitter, you might want to leave that out of your average. Or you might want to leave it in! It depends on your scenario. But regardless, it is wise to be aware that there is a cost there, and that it needs to be taken into account when reasoning about measurements.

The same is true of garbage collection, which is the subject of today’s episode of this tutorial. So without further ado:

Mistake #8: Forget to take garbage collection effects into account.

Before we get into how to mitigate the situation, a brief explanation of how the .NET garbage collector works is in order. The explanation which follows is somewhat simplified but gets the necessary ideas across.

Imagine a large block of memory, far larger than the size of any typical C# object. Roughly halfway through that block of memory is a memory location called the “high water mark”. Everything below the high water mark is the storage for the fields of objects and the elements of arrays. Above the high water mark, the remainder of the memory block is empty, just thousands and thousands of zero bits. This is the managed heap.

When you allocate a new array, or a new object of reference type, or box an instance of a value type, the storage for the array elements, object fields or boxed value has to go somewhere. The high water mark is moved up by the required number of bytes and the storage is passed back to the user to be initialized with the array elements, or to be filled in by the constructor, or to have the value copied in.

This can’t go on forever of course. What happens when the high water mark reaches the top of the block?

This triggers a collection. All the threads in the program are temporarily frozen, because they might be using memory that the garbage collector is about to mess with. (In some versions of the CLR it is possible to continue running code while the GC is running, but we will not discuss this advanced scenario.) The garbage collector, which runs on a thread of its own, is activated. It sets a bit on every object in the memory block that says “this object is dead”. It knows what objects are “roots”; that is, what objects are known definitely to be in use by the program at this time. It clears the bit on those objects, and then recursively clears the bit on every object referenced by those objects, and so on. Before long, every object that can be reached by accessing a field or element of a rooted object is marked as alive by its cleared bit; the rest are marked as dead.

Next, any of the dead objects that have finalizers that need to run are marked as being alive and moved onto a special queue. The finalizer queue is a root of the garbage collector, so those objects will stay alive until their finalizers run (on yet another thread) at which time they will be removed from the queue. At that time they will likely be truly dead, and the garbage collector will collect them.

Finally, the dead objects are now “holes” that can be filled in by living objects. The garbage collector moves the living objects near the high water mark into the holes, and thereby lowers the high water mark.

So far what I’ve described is a single generation mark-and-sweep collector. The “desktop” .NET garbage collector is actually more sophisticated than that. It actually has three large blocks of memory, called the generation 0, generation 1 and generation 2 heaps. When generation 0 is collected, the surviving objects are not copied back into the holes; rather, they are copied up to generation 1, and the generation 0 block becomes empty. When the generation 1 heap is collected, again, the survivors are copied up to generation 2.

The idea here is that we are identifying objects that are short lived: everything that was dead in generation 0 was very short lived. Objects collected in generation 1 were longer-lived, and objects that make it all the way to generation 2 are extremely long-lived.

Because garbage collection requires tracing the set of living objects, the cost of garbage collection is gated on the number of living objects. By using a generational approach, inexpensive generation 0 collections happen frequently; these are the most likely to produce garbage, and the number of living generation 0 objects is likely to be small. By contrast, long-lived objects that make it to generation 2 are the least likely to be released on any given collection. If there are a lot of them then collection will be expensive and might not actually free up much space. The runtime therefore wants to run a generation 2 collection as infrequently as it can get away with.

So, summing up, the takeaways here are:

Garbage collection stops the world while the collection happens
Generation 0 collections are frequent but inexpensive; generation 2 collections are infrequent and likely to be expensive; generation 1 collections are in the middle.
Objects with finalizers automatically survive; the finalizers run on a different thread, after the collection happens.

How then does this impact performance benchmarks? Let’s think of a few different scenarios.

Scenario one: Your benchmark produces only short-lived garbage, and the same number of collections are triggered every time you run the benchmark. This is the best situation you can be in, because every time through the benchmark you are measuring the actual cost that would be observed by a user. Unfortunately, things seldom work out so tidily.

Scenario two: Your benchmark produces only short-lived garbage, and moreover, it produces enough short-lived garbage to trigger a collection, let’s say every ten times the benchmark is executed. If you are running the benchmark fewer than ten times, essentially this means that you are never measuring the cost of a collection because you’re never triggering one. If you are running the benchmark a hundred times then roughly every tenth run will be more expensive than the previous nine.

This has both positive and negative effects. On the positive side, you can now compute both the average cost and the maximum typical cost, which is useful data. On the negative side, it can be confusing to see variations like this in your measurements if it is not clear what is going on.

Scenario three: Now consider scenarios one and two again, but suppose this time that the benchmark produces long-lived garbage that is infrequently collected, and suppose some of the garbage items have finalizers. Again, you might be failing to measure the true costs if there are no generation-2 collections while your benchmarks are running. Also, the finalizer thread is another thread which might be consuming processor time that your benchmark could be using. This is particularly of concern if there are more active threads in your program than hardware processors to service them.

Scenario four: Your benchmark is actually testing two different functions, and comparing their costs; benchmarks are often used for comparative testing. Suppose the garbage produced by the first method is collected during the execution of the second method; now we’re making an inaccurate comparison of the two alternatives because one is being charged a cost that was incurred by the other.

Scenario five: Your benchmark mechanisms themselves are producing garbage, causing collections to happen more often than normal during the code that you are benchmarking. Again, you might not be measuring the true cost of a piece of code, because this time the costs of the benchmarking environment are being charged to the code being tested.

In short, failing to take GC costs into account can cause you to fail to measure the true cost of an operation, or cause you to charge that cost to the wrong code.

So how do you mitigate these problems?

What I usually do when benchmarking, particularly if it is comparative benchmarking as described in scenario four, is to force the garbage collector to do a full collection of all three generations before and after every test. This ensures that every test is on the level playing field.

You might consider measuring the costs of each of those collections as well, to try to get a handle on just how much time you’re spending on every full collection. Remember, full collections are going to be more expensive than the partial collections that are normally performed, so this will give you an upper bound on the cost, not the true cost.

To force the garbage collector to do a full collection you should actually do two steps:

System.GC.Collect();
System.GC.WaitForPendingFinalizers();

The first line causes all three generations to be collected; all threads are stopped and the GC reclaims everything that it can. But that might have caused objects to be placed on the finalizer queue, which runs on its own thread. In order to make sure we are measuring the cost of finalization, we stop the world again and wait for the finalizers to all run.

If you’ve been paying close attention so far, something should be nagging you at this point. Remember that I said that objects on the finalizer queue are alive until they are finalized, and therefore they live until a collection happens after their finalization. By doing these two steps we are not actually guaranteeing that every dead object has been cleaned up; an object that was dead that needed finalization before the call to GC.Collect() will be alive after the call, and then dead after the call to GC.WaitForPendingFinalizers(), but not cleaned up yet because the garbage collector just did a collection; it’s not going to run again for a while.

Usually the number of objects that survive to the next generation because they need finalization is extremely small, so this situation usually does not produce a measurable impact. However if your benchmarked method produces a huge amount of garbage that needs finalization, you might consider doing a second GC.Collect() afterwards, to ensure that objects that survived only to be finalized are cleaned up. (And if you are producing a huge amount of garbage that needs finalization, there’s probably something that can be improved in your program. But that’s a subject for another day.)

Next time, we’ll finish up this series with a look at ways to make sure that your measurements are repeatable and realistic.

As noted above, part five never got written because the company which commissioned it went out of business. I’m writing this update in 2020, seven years later, so I don’t recall exactly what I was planning to say. But it was something about exploring the fundamental difficulty of achieving both “repeatable” and “realistic” results in a benchmark. For repeatable results you want as “clean” a testing environment as possible: same operating system, same frameworks, same processes running, same disk speed and network speed and so on. But in realistic customer scenarios, none of that is true! I don’t recall what I was going to suggest you do about it though. It’s a tricky problem.

Benchmarking mistakes, part three

Posted on July 9, 2013 by ericlippert

As I discussed in part one of this series, a C# program actually goes through two compilers before the code runs. First the C# compiler converts the code into “Common Intermediate Language”, or “IL”, which is written into the assembly — the .EXE or .DLL file of the program. Then at runtime, immediately before a method is called for the first time, the aptly-named “just in time” compiler — the jitter, for short — transforms the IL into actual executable machine code. Because computer programmers love to verb words, this process is called “jitting a method”.

The jitter was built to be fast, but it’s not infinitely fast. You should expect that the first time you run a method it will be considerably slower; before it can run for the first time all its IL has to be translated into machine code. And of course, if the methods that it calls in turn are all being called for the first time then they’ve got to be jitted as well.

But wait, it gets worse. Suppose you are calling a method for the first time, and it in turn creates an object from a library. If that library is being used for the first time in this process, not only does the code for the constructor have to be jitted, but before it can be jitted the library has to be found on disk, mapped into virtual memory, analyzed for security problems, and any code in the static constructors of the newly-created object has to be jitted and executed.

In short, calling code for the first time can be massively more expensive than calling it for the second time. So this brings us to…

Mistake #6: Treat the first run as nothing special when measuring average performance.

In order to get a good result out of a benchmark test in a world with potentially expensive startup costs due to jitting code, loading libraries and calling static constructors, you’ve got to apply some careful thought about what you’re actually measuring.

If, for example, you are benchmarking for the specific purpose of analyzing startup costs then you’re going to want to make sure that you measure only the first run. If on the other hand you are benchmarking part of a service that is going to be running millions of times over many days and you wish to know the average time that will be taken in a typical usage then the high cost of the first run is irrelevant and therefore shouldn’t be part of the average. Whether you include the first run in your timings or not is up to you; my point is, you need to be cognizant of the fact that the first run has potentially very different costs than the second.

Let’s illustrate the enormous effect that the jitter can have on the first run of a program by modifying our example from last time. Let’s suppose we’ve written our own implementation of the standard “quicksort” algorithm and we want to benchmark its performance:

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;

static class Extensions
{
  public static void Quicksort(this int[] items)
  {
    Quicksort(items, 0, items.Length - 1);
  } 
  static void Quicksort(int[] items, int leftBound, int rightBound)
  {
    int left = leftBound;
    int right = rightBound;
    int pivotIndex = leftBound + (rightBound - leftBound) / 2;
    int pivot = items[pivotIndex]; 
    while (left <= right)
    {
      // Find an item greater than the pivot
      while (items[left] < pivot)
        left += 1;
      // Find an item less than the pivot
      while (pivot < items[right])
        right -= 1;

      // Swap them
      if (left <= right)
      {
        Swap(ref items[left], ref items[right]);
        left += 1;
        right -= 1;
      }
    }

    // Everything less than the pivot is now left of the pivot.
    // Sort this portion.
    if (leftBound < right)
      Quicksort(items, leftBound, right);
    // Same for the right.
    if (left < rightBound)
      Quicksort(items, left, rightBound);        
  }
  private static void Swap(ref int x, ref int y)
  {
    int t = x;
    x = y;
    y = t;
  } 
}

That’s a pretty standard implementation of quicksort in C#. Now let’s make some modifications to our benchmark test of last time. Here we’ll make a random list, copy it to an array twice so that we know that both runs will be operating on exactly the same data, and see if we can deduce the cost of jitting those methods the first time they’re called:

class P
{        
  static void Main()
  {
    var original = new List<int>();
    const int size = 1000;
    var random = new Random();
    for(int i = 0; i < size; ++i)
      original.Add(random.Next());

    var arr1 = original.ToArray();
    var arr2 = original.ToArray();

    var stopwatch = new System.Diagnostics.Stopwatch();
    stopwatch.Start();
    arr1.Quicksort();
    stopwatch.Stop();
    Console.WriteLine(stopwatch.Elapsed);
    stopwatch.Reset();
    stopwatch.Start();
    arr2.Quicksort();
    stopwatch.Stop();
    Console.WriteLine(stopwatch.Elapsed);
  }
}

When I run this on my laptop (again, remembering to compile into release mode, and running without the debugger attached) a typical result is 3500 microseconds for the first run and 700 microseconds for the second run; in other words, the first run here took roughly five times longer than the second run on average. It must have taken the jitter about 2.8 milliseconds on average to find the IL and translate it into efficient machine code.

Of course, that factor is relative, and I chose the array size somewhat arbitrarily. If we were sorting an array with a million elements then the ~3 millisecond jit cost would be a barely-noticeable bump. If we were sorting an array with twenty elements then the jit cost wouldn’t be five times worse; it could be a hundred times worse.

Moreover, it’s important to note that different jitters give different results on different machines and in different versions of the .NET framework. The time taken to jit can vary greatly, as can the amount of optimization generated in the machine code. The jit compilers on the Windows 32 bit desktop, Windows 64 bit desktop, Silverlight running on a Mac, and the “compact” jitter that runs when you have a C# program in XNA on XBOX 360 all have potentially different performance characteristics. That’s…

Mistake #7: Assuming that runtime characteristics in one environment tell you what behavior will be in a different environment.

Run your benchmarks in the actual environment that the code will be running in; use machines that have the same hardware and software that will typically be used by the customers who ultimately will run the code.

Next time in this series we’ll take a look at how the garbage collector can affect performance benchmarks.

String concatenation behind the scenes, part two

Posted on June 24, 2013 by ericlippert

It is, I hope, well-known that naïve string concatenation in a loop is a quadratic “Schlemiel” algorithm. (If you haven’t read Joel’s essay, stop reading this and go read it now.)

Continue reading →

Benchmarking mistakes, part two

Posted on May 21, 2013 by ericlippert

So far in this series we’ve learned about the jitter and how it compiles each method in your program “on the fly”; we’ll come back to jitter issues in the next episode; in this episode I want to look at some actual (terrible) code for measuring performance.

Let’s suppose part of a larger program will involve sorting a long list of integers, and we’d like to know how much time that is going to take. Rather than testing the entire “real” program and trying to isolate the specific part, we’ll write a benchmark that just answers the question at hand.

There are a lot of things wrong with the code below that we’ll get to in later episodes; the issue I’m going to focus on today is:

Mistake #5: Using a clock instead of a stopwatch.

using System;
using System.Collections.Generic;
class P
{
  public static void Main()
  {
    var list = new List<int>();
    const int size = 10000;
    var random = new Random();
    for(int i = 0; i < size; i += 1)
      list.Add(random.Next());
    var startTime = DateTime.Now;
    list.Sort();
    var stopTime = DateTime.Now;
    var elapsed = stopTime - startTime;
    Console.WriteLine(elapsed);
  }
}

That looks pretty reasonable, right? We create some random data to test. Let’s assume that ten thousand elements is about the typical problem size that the customer will see. We are careful to time only the actual sort and not include the creation of the random data. If we run it on my laptop I see that it reports about 3 milliseconds. How accurate is this?

That result is off by a factor of more than two from the correct answer; the sort actually takes about 1.2 millseconds. What explains this enormous discrepancy, and how did I know what the correct answer was?

Well, suppose you were trying to time this program by actually watching a grandfather clock that ticks once per second. This is of course ludicrous because you know what would happen: either the clock would tick during the few milliseconds the program was active, or it wouldn’t. If it did then you’d record one second, if not, then you’d record zero. The resolution of the clock is nowhere near good enough to actually measure the elapsed time with the necessary precision.

DateTime.Now is only slightly better than that grandfather clock: it only ticks every few milliseconds. Worse, the rate at which it ticks is not fixed from machine to machine. On modern Windows machines you can expect that the tick resolution will be around 100 ticks per second or better under normal circumstances, which is a resolution of 10 milliseconds or less. Apparently things went well for me on this laptop and I got a resolution of 3 milliseconds, which is great for DateTime.Now but still nowhere near fast enough to accurately measure the time spent sorting.

The correct tool to use when writing benchmarks is Stopwatch in the System.Diagnostics namespace. (I note that this namespace is well-named; everything in here is good for diagnosing problems but should probably not be used in “production” code.) Let’s take a look at the code written to use Stopwatch instead: (There is still plenty wrong with this benchmark, but at least we’re improving.)

using System;
using System.Collections.Generic;
using System.Diagnostics;
class P
{
  public static void Main()
  {
    var list = new List<int>();
    const int size = 10000;
    var random = new Random();
    for(int i = 0; i < size; i += 1)
      list.Add(random.Next());
    var stopwatch = new System.Diagnostics.Stopwatch();
    stopwatch.Start();
    list.Sort();
    stopwatch.Stop();
    Console.WriteLine(stopwatch.Elapsed);
  }
}

Notice first of all how much more pleasant it is to use the right tool for the job; the Stopwatch class is designed for exactly this problem so it has all the features you’d want: you can start, pause and stop the timer, computing the elapsed time does not require a subtraction, and so on.

Second, the results now show far more accuracy and precision: this run took 1.1826 milliseconds on my laptop. We have gone from ~10 millisecond precision to sub-microsecond precision!

How precise is the stopwatch, anyway? It depends on what kind of CPU you have; the Stopwatch class actually queries the CPU hardware to get such high precision. The Stopwatch.Frequency read-only field will tell you what the resolution is; on my laptop it is two million ticks per second, which you’ll note is considerably larger than the operating system’s weak promise of around 100 ticks per second for the grandfather clock that is DateTime.Now. This laptop is pretty low-end; better hardware can provide even higher resolution than that. Remember, one two-millionth of a second is still hundreds of processor cycles on modern hardware, so there’s room for improvement.

Suppose for the sake of argument though that the code we were benchmarking was going to run for seconds, minutes or even hours. In those cases the difference between a resolution of 100 ticks per second and 2 million ticks per second is irrelevant, right? If you’re going to be measuring minutes then you don’t need the second hand to be super accurate, so why not use DateTime.Now in those cases?

It’s still a bad idea because Stopwatch has been specifically designed to solve the problem of easily measuring time spent in code; DateTime.Now was designed to solve a completely different problem, namely, to tell you what time it is right now. When you use the wrong tool for the job, you sometimes can get strange results.

For example, suppose you use DateTime.Now to measure the time in a long-running performance benchmark that you run overnight on a machine in Seattle. Say, on Sunday November 3rd, 2013. The result you get is going to be wrong by one hour, and might in fact even be a negative number because DateTime.Now pays attention to Daylight Savings Time changes.

Just don’t even go there. DateTime.Now is the wrong tool for the job, was designed to solve a different problem, is harder to use than Stopwatch, and has thousands or millions of times less precision. Avoid it entirely when writing benchmarks in C#.

Next time in this series we’ll take a closer look at how the jitter can affect your benchmarks.

Benchmarking mistakes, part one

Posted on May 14, 2013 by ericlippert

In this series of articles, I’m going to go through some of the mistakes I frequently see people making who are attempting to write benchmarks in C#. But before we get into the mistakes, I suppose I should introduce myself and define the term.

Hi, I’m Eric Lippert; I work at Coverity where I design and implement static analyzers to find bugs in C# programs. Before that I was at Microsoft for 16 years working on the C#, VBScript, JScript and Visual Basic compilers, amongst other things. I write a blog about language design issues at EricLippert.com. Many thanks to the editors here at Tech.pro for inviting me to write this series.

OK, let’s get into it. What exactly do I mean when I say benchmark?

The term comes originally from surveying; a surveyor would mark an object that was at an already-known position and then use that mark to determine the relative and absolute positions of other objects. In computing the term, like so much other jargon, has been liberally borrowed and now means pretty much any sort of performance comparison between two alternatives.

Benchmarking is often used to describe the performance of computer hardware: you write a program, compile and execute it on two different computers, and see which one performs better. That’s not the kind of benchmarking I’m going to talk about in this series; rather, I want to talk about performance benchmark tests for software.

I want to clarify that further, starting with the meaning of “performance” itself. The first rule of software metrics is well known: you get what you measure.

If you reward people for making a measurable improvement in memory usage, don’t be surprised if time performance gets worse, and vice versa. If you reward improvement rather than achieving a goal then you can expect that they’ll keep trying to make improvements even after the goal has been achieved (or worse, even if it is never achieved!)

This brings us to our first benchmarking mistake:

Mistake #1: Choosing a bad metric.

If you’ve chosen a bad metric then you’re going to waste a lot of effort measuring and improving an aspect of the software that is not relevant to your users, so choose carefully.

For the rest of this series I’m going to assume that the relevant performance metric that your benchmark measures is average execution time, and not one of the hundreds of potential other metrics, like worse-case time, memory usage, disk usage, network usage, and so on. This is the most common metric for performance benchmarks and hence the one I see the most mistakes in.

I also want to clarify one other thing before we dive in. I’m assuming here that the purpose of the benchmark is to empirically determine the performance of a small part of a larger software project so that an informed decision can be made.

For example, you might have a program that, among its many other tasks, sometimes has to sort a large set of data. The benchmarks I’m talking about in this series are the narrowly targeted tests of, say, half a dozen different sort algorithms to determine which ones yield acceptable performance on typical data; I’m not talking about “end to end” performance testing of the entire application. Often in large software projects the individual parts have good performance in isolation, but bad performance in combination; you’ve got to test both.

That brings us to:

Mistake #2: Over-focusing on subsystem performance at the expense of end-to-end performance.

This series of articles is going to be all about subsystem performance; don’t forget to budget some time for end-to-end testing as well.

So far we’ve seen some very general mistakes; now let’s start to dig into the actual mistakes people make in implementing and executing their subsystem performance benchmarks in C#. The number one most common mistake I see is, no kidding:

Mistake #3: Running your benchmark in the debugger.

This is about the worst thing you can possibly do. The results will be totally unreliable. Think about all the things that are happening when you run a managed program in a debugger that are not happening when your customer runs the program: the CLR is sending information to the debugger about the state of the program, debug output is being displayed, heck, an entire other enormous process is running.

But it gets worse, far worse.

The jit compiler knows that a debugger is attached, and it deliberately de-optimizes the code it generates to make it easier to debug. The garbage collector knows that a debugger is attached; it works with the jit compiler to ensure that memory is cleaned up less aggressively, which can greatly affect performance in some scenarios.

But perhaps I am getting ahead of myself. What is this “jit compiler” thing? In order to make sense of the next episode in this series you’ll need to have a pretty solid understanding of how compilation works in .NET. Here’s the high level view.

Let’s suppose you write some source code in C# using Visual Studio. When you build that project the IDE starts up the C# compiler. A compiler is by definition a program which translates a program written in one language into “the same” program written in another language. The C# compiler translates C# code into a different language, IL, the Intermediate Language. (Also sometimes notated CIL for Common IL or MSIL for Microsoft IL, but we’ll just stick with “IL”.)

IL is a very low-level language designed so that in its compressed binary form it is reasonably compact but also reasonably fast to analyze. A managed assembly (a .exe or .dll file) contains the IL for every method in the project as well as the “metadata” for the project: a compact description of all the classes, structs, enums, delegates, interfaces, fields, properties, methods, events,… and so on in your program.

When you run code in an managed assembly, the Common Language Runtime (CLR) reads the metadata out of the assembly to detemine what the types and methods and so on are. But the real miracle of the CLR is the Just In Time compiler — “the jitter” for short. The jitter is a compiler, so again, it translates from one language to another. The CLR runs the jitter on the IL associated with a method immediately before that method is about to run for the first time — hence the name “Just In Time compiler”. It translates the IL into the machine code that will actually execute on the processor.

So now perhaps it is more clear why understanding the jitter behaviour is so important when benchmarking code; the jitter is dynamically generating the actual machine code on the fly, and therefore determining how heavily optimized that machine code is. The jitter knows whether there is a debugger attached or not, and if there is then it figures it had better not be aggressive about optimizations because you might be trying to inspect the code in the debugger; heavily optimized code is harder to understand. But obviously the unoptimized code will be less performant, and therefore the benchmark is ruined.

Even if you don’t run your benchmark program in the debugger it is still important to make sure that you are not telling the jitter to go easy on the optimizations:

Mistake #4: Benchmarking the debug build instead of the release build.

If you compile a project in Visual Studio in “debug” mode then both the C# compiler and the jit compiler will again deliberately generate less-optimized code even if you run the program outside of the debugger on the assumption that clarity is better than speed when you are attempting to diagnose problems.

And of course the debug version of your program might contain special-purpose code of your own devising to make debugging easier. For example, expensive assertions might be checked which would be ignored in the release build.

Both mistakes #3 and #4 are actually specific versions of a more general mistake: testing the code in an environment radically different from the customer’s environment. The customer is not going to be running the debug version of your product, so don’t test that version. We’ll come back to this point in a later episode.

Next time in this series I’ll talk about mistakes made in specific measurement techniques; after that we’ll take a look at some more subtle ways in which forgetting about the jitter can lead to bad benchmarks.

Nullable micro-optimizations, part eight

Posted on January 21, 2013 by ericlippert

Today, the answer to the puzzle I posed last week. The answer is no, that optimization is not legal in general, though it is legal in a few specific cases.

As I have discussed several times before, C# has very strict rules about what order operations happen in. A quick rule of thumb is: operands run left-to-right. In our example we had X() * Y() + Z(), so let’s see what conclusions we reach by applying this rule:

X() is an operand of the * to the left of Y(), so the call to X() must happen before the call to Y().
X()*Y() is an operand of the + to the left of Z(), so the multiplication must happen before the call to Z().

The compiler must generate code that computes X(), then Y(), then the multiplication, then Z(), and then the addition. But our proposed optimization was

int? r;
int? tempX = X();
int? tempY = Y();
int tempZ = Z();
r = tempX.HasValue & tempY.HasValue ?
  new int?(tempX.GetValueOrDefault() * tempY.GetValueOrDefault() + tempZ) :
  new int?();

which computes Z() before the multiplication.

So what’s the big deal? Multiplication of integers can throw an exception in a checked context. That exception should prevent Z() from being called in the first place should X() * Y() throw on the multiplication. This optimization is only valid in an unchecked context.

And of course it just gets worse if we start to consider lifted arithmetic on types other than int?. It’s a bad practice to write a user-defined operator with a side effect, but it is legal and the C# compiler must ensure that side effects of user-defined operators are observed to happen in the prescribed order.

Rather than try to figure out all the types for which this optimization is legal in various contexts, Roslyn makes no attempt to optimize this kind of binary operator. It generates a temporary int? for the multiplication, and then does the regular lifted arithmetic for the addition. Another lovely optimization spoiled by conflict with an lovely invariant.

But wait!

The problem is that the side effects of the multiplication must be observed to come before the side effects of the right addend. What if the right addend has no side effects? Then we can do this optimization!

The most common situation in which the right addend has no side effects is when it is a constant:

int? r = X() * Y() + 1;

This can legally be optimized to

int? r;
int? tempX = X();
int? tempY = Y();
r = tempX.HasValue & tempY.HasValue ?
  new int?(tempX.GetValueOrDefault() * tempY.GetValueOrDefault() + 1) :
  new int?();

And in fact I did add this optimization to Roslyn; just as unary operations and conversions can be “distributed”, so can binary operations where the right operand is a constant. Moreover, as a pleasant side effect, doing so allowed for an easy implementation of various optimizations that improve the quality of lifted x += 1 and x++ expressions.

Well, that took rather more episodes than I thought it would when I started! I could talk a bit more about how Roslyn optimizes other more exotic nullable arithmetic, like equality, inequality, and logical operations. I could also talk a bit about the stuff I didn’t get to implement before I left; Roslyn does a slightly worse job than the original recipe compiler when optimizing expressions where we know that the result is going to be null, but also must compute a side effect. But rather than go down those lengthy rabbit holes, I think I’ll call this series done at eight episodes.

Next time on FAIC: Some of these optimizations we’ve been talking about are elisions; I’ll talk a bit about how computer programmers have borrowed this term from linguistics, and how we use it in two different senses.

Nullable micro-optimizations, part seven

Posted on January 17, 2013 by ericlippert

Today, a puzzle for you.

We’ve been talking about how the Roslyn C# compiler aggressively optimizes nested lifted unary operators and conversions by using a clever technique. The compiler realizes the inner operation as a conditional expression with a non-null nullable value on the consequence branch and a null nullable value on the alternative branch, distributes the outer operation to each branch, and then optimizes the branches independently. That then gives a conditional expression that can itself be the target of further optimizations if the nesting is deeper.

This works great for lifted conversions and unary operators. Does it also work for binary operators? It seems like it would be a lot harder to make this optimization work for a lifted binary operator where both operands are themselves lifted operations. But what if just one of the operands was a lifted operation, and the other operand was guaranteed to be non-null? There might be an opportunity to optimize such an expression. Let’s try it. Suppose X() and Y() are expressions of type int? and that Z() is an expression of type int:

int? r = X() * Y() + Z();

We know from our previous episodes that operator overload resolution is going to choose lifted multiplication for the inner subexpression, and lifted addition for the outer subexpression. We know that the right operand of the lifted addition will be treated as though it was new int?(Z()), but we can optimize away the unnecessary conversion to int?. So the question is can the C# compiler legally code-generate that as though the user had written:

int? r;
int? tempX = X();
int? tempY = Y();
int tempZ = Z();
r = tempX.HasValue & tempY.HasValue ?
  new int?(tempX.GetValueOrDefault() * tempY.GetValueOrDefault() + tempZ) :
  new int?();

If you think the answer is “yes” then the follow-up question is: can the C# compiler legally make such an optimization for all nullable value types that have lifted addition and multiplication operators?

If you think the answer is “no” then the follow-up questions are: why not? and is there any scenario where this sort of optimization is valid?

Next time on FAIC we’ll be kind to our fine feathered friends; after that, we’ll find out the answer to today’s question.

Eric is crazy busy at Coverity’s head office; this posting was pre-recorded.

Nullable micro-optimizations, part six

Posted on January 15, 2013 by ericlippert

Previously in this series I said that the fact that the original C# compiler pursues a less aggressive strategy for optimizing away temporaries and branches from nested lifted conversions and unary operators because it suffers from “premature optimization”. That’s a loaded term and I’m not using it in the standard sense, so I want to clarify that a bit.

Donald Knuth, author of the classic four-volume series The Art of Computer Programming, famously said “premature optimization is the root of all evil.“. I think however that it is more instructive to read that quotation with more context:

We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified.

Which is of course what I have echoed in my numerous performance rants over the years: don’t waste your valuable time making risky changes to the 97% of the code that isn’t the slowest thing and that no customer will ever notice. Use a profiler, find the slowest 3%, and spend your optimization budget on that.

That is good advice, but when I say that a compiler suffers from “premature optimization”, that’s not at all what I mean. Rather, I mean that the compiler performs an optimization pass too early in the compilation process.

Back in 2010 I described in considerable detail the various stages, or “passes”, that the original recipe C# compiler performs when going from raw text to IL, so you might want to read that. For the purposes of this discussion we can simplify that all down to four stages:

Lexical and grammatical analysis
Initial “binding” — that is, semantic analysis
“Lowering” — that is, rewriting high-level code into low-level code — and additional error detection
Code generation

You would expect that semantic optimizations like the ones I described back in 2009 such as lifted arithmetic lowering would happen in the third stage. Some optimizations of course happen during the fourth stage, because the code generator itself can identify branches and temporaries that can be eliminated. But most happen in the third stage.

Doing an optimization in the wrong stage can introduce correctness problems; for some examples of how premature optimizations in the initial binding pass led to bugs and breaking changes, see my posts from 2006 on that subject. Part one. Part two.

Today I’m not concerned about correctness; I’m concerned about how complete the optimization is. The implementation decision which is vexing me today is that the original recipe C# compiler’s strategy is that the initial binding pass identifies portions of lifted arithmetic expressions that can be optimized later, and flags them as needing attention during the lowering pass, which is where the optimizations are done.

I am over-simplifying here; it is not as simple as a Boolean flag in most cases. In fact, the amount of information that is stored by the initial binding pass for the use of the optimizer later is quite scary because it is easy to accidentally use the wrong bits when lowering. An example of such a bug is in this StackOverflow question. But we can think of it logically as a flag.

The problem is that the initial binding pass only identifies opportunities for optimization based on the original form of the code. If the optimization pass produces “lowered” code that is itself amenable to further optimization then it is never optimized because there’s no flag left in there by the initial binding pass! Deciding whether something could benefit from optimization was being done too soon.

To make a long story short — and yes, this seems to have gotten rather long, sorry — the practical upshot is that the original recipe compiler is very good at finding “shallow” optimization opportunities on lifted operations, but very bad at making optimizations compose nicely when lifted operations are deeply nested; those tend to generate lots of unnecessary temporaries and branches.

Like I said previously, the compiler is not required to make those optimizations, but it has always vexed me that it does not. Roslyn improves on this situation by deferring all lowering and optimization of lifted arithmetic to the lowering phase; only the bare minimum analysis is performed during the initial binding pass. Roslyn optimizes each lifted arithmetic expression as it lowers it to temporaries and conditionals, and then tries aggressively to “distribute” lifted unary operations and conversions into those generated conditionals in order to skip creation of unnecessary temporaries and branches.

Next time on FAIC: Is it ever possible to apply this optimization technique to a lifted binary operator? I’ll pose that question in the form of a puzzle.

Eric is crazy busy at Coverity’s head office; this posting was pre-recorded.

Nullable micro-optimization, part five

Posted on January 10, 2013 by ericlippert

Last time in this series I described how a lifted implicit conversion could be “distributed” to both branches of the conditional operator that is the realization of a lifted arithmetic expression, and then optimized further on each side. Of course, the thing being converted can be any lifted expression in order to take advantage of this optimization. This means that the optimization “composes” nicely; the optimization could be repeatedly applied when lifted operations are nested.

This is a bit of a silly illustrative example: suppose you have expressions x and y of type A? with a lifted addition operator that produces an A?. There’s also a lifted conversion from A? to B?, and similarly from B? to C?.

C? c = (C?)(B?)(x + y);

As we discussed previously in this series, the compiler realizes the lifted addition as a conditional expression. We know that the lifted conversion to B? can be “distributed” to the consequence and alternative branches of the conditional expression. That then results in a different conditional expression, but one such that the conversion to C? can be distributed to each branch of that! That is, the compiler could realize the code above as:

C? c;
A? temp1 = x;
A? temp2 = y;
c = temp1.HasValue & temp2.HasValue ? 
  new C?((C)(B)(temp1.GetValueOrDefault() + temp2.GetValueOrDefault()) :
  new C?();

… by applying the optimization twice, rather than creating a temporary of type A? for the sum and a temporary of type B? for the conversion of the sum, each with its own conditional expression. The aim of the optimization is to reduce the number of temporaries and conditional expressions, and thereby make the code smaller and produce fewer basic blocks.

A lifted conversion is rather like a lifted unary operator, and in fact the compiler could do the analogous optimization for the lifted unary +, -, ~ and ! operators. Continuing our silly example, suppose we have a lifted ~ operator on A? that produces an A?. If you said:

C? c = (C?)(B?)~(x + y);

Then the ~ operation can also be “distributed” to each branch of the conditional just as the conversions can be. The insight here is the same as before: if the consequence and alternative are both of the same type then

~(condition ? consequence : alternative)

is the same as

condition ? ~consequence : ~alternative

When we furthermore know that the consequence is of the form new A?(something) then we know that ~consequence is the same as new A?(~something). When we know that the alternative is of the form new A?(), then we know that ~new A?() is going to be a no-op, and just produce new A?() again. So, to make a long story short, the C# compiler can codegen the code above as:

C? c;
A? temp1 = x;
A? temp2 = y;
c = temp1.HasValue & temp2.HasValue ? 
  new C?((C)(B)(~(temp1.GetValueOrDefault() + temp2.GetValueOrDefault())) :
  new C?();

Again, we save several temporaries and branches by performing this optimization.

Now, I’ve been saying “the compiler could” a lot because of course a compiler is not required to perform these optimizations, and in fact, the “original recipe” compiler is not very aggressive about performing these optimizations. I examined the original recipe compiler very closely when implementing nullable arithmetic in Roslyn, and discovered that it suffers from a case of “premature optimization”.

Next time on FAIC: We’ll digress for the next couple of posts. Then I’ll pick up this subject again with a discussion of the evils of “premature optimization” of nullable arithmetic, and how I’m using that loaded term in a subtly different way than Knuth did.

Fabulous adventures in coding

Eric Lippert's blog

Category Archives: Performance