Life, part 6

Code for this episode can be found here. The only interesting change I’ve made to the client is that if you press “P”, it pauses the simulation and runs 5000 steps of the “acorn” pattern, and then dumps the number of milliseconds that took to a text file. This is a quick and dirty performance test, but as we’ll see, we can get great results even with this primitive tool.

Last time we demonstrated that a single tick in the naïve “array of bools” algorithm is O(n) in the number of cells, which is not surprising — after all, we do a neighbor count of every cell on every tick. This gives us the insight that if we quadrupled the number of cells, we’d probably quadruple the time to compute a step, which is good to know, but we still don’t know how bad it is, aside from noting that we’re having no problem keeping up with a cadence of two ticks per second of course!

As I mentioned in the preamble, I wrote the quickest little perf tester you could imagine and ran it on the release version of the app. There is no point in running performance testing on a debug app, or on a release app while it is running in the debugger; remember, the runtime knows it is being debugged and changes the performance characteristics of the program in order to make it easier to debug.

On my home machine — a decidedly mediocre mid-end PC game box — I got 5000 steps of “acorn” in 17 seconds, which is around 300 steps per second. By 1992 standards, that would be insanely great, but let me tell you, machines are a lot faster today than they were in 1992.

I did not deliberately “sandbag” this implementation — I did not write it to have a deliberately slow bit — so there’s no trick to figuring out where we’re spending time. I wrote it with an emphasis on clarity and correctness, knowing that we could come back later and find the hot spots; it’s a lot easier to optimize code if it is already clear and correct.

The question now is: where would you guess most of the time is being spent in the calculation? (Remember, we are omitting the time to draw to the bitmap.)

If you didn’t have a profiler on your machine — and I don’t! I have the free version of VS on my home machine — what would your gut tell you? Because my gut is almost always at least partially wrong when it comes to performance.

Let’s quickly review the naïve algorithm. The major parts are:

Test to see if a point is inside our 8-quad of valid cells:

private bool IsValidPoint(long x, long y) => 
  0 <= x && x < width && 0 <= y && y < height;

Fetch the value from the array if the point is valid; dead otherwise:

public bool this[long x, long y]
    if (IsValidPoint(x, y))
      return cells[x, y];
    return false;

Count the neighbors:

private int LivingNeighbors(int x, int y)
  int sum = this[x - 1, y - 1] ? 1 : 0;
  sum += this[x - 1, y] ? 1 : 0;
  sum += this[x - 1, y + 1] ? 1 : 0;
  sum += this[x, y - 1] ? 1 : 0;
  sum += this[x, y + 1] ? 1 : 0;
  sum += this[x + 1, y - 1] ? 1 : 0;
  sum += this[x + 1, y] ? 1 : 0;
  sum += this[x + 1, y + 1] ? 1 : 0;
  return sum;

And the main loop:

bool[,] newCells = new bool[width, height];
for (int y = 0; y < height; y += 1)
  for (int x = 0; x < width; x += 1)
    int count = LivingNeighbors(x, y);
    newCells[x, y] = count == 3 || (cells[x, y] && count == 2);
cells = newCells;

So, no peeking below: without a profiler to guide you, what would you assume was the low-hanging fruit that could be attacked here to make the whole thing a little faster without changing the basic algorithm or data structures?

I find that I’m dead wrong about a third of the time when I look at a piece of code and guess where the time is being spent, and fortunately this was one of the times I guessed correctly. The biggest cost by far in this loop is that it calls IsValidPoint eight times on every loop iteration. Since there are 65536 loop iterations per tick, that’s over two billion validity checks in our performance run.

Moreover, the vast majority of the time, the point is valid; the only “invalid” points we generate are when looking for neighbors when on edges.

How can we improve this? The easiest way is just to note that we have two cases, the interior points where all neighbors are valid, and the edges where not all neighbors are valid. If we are in the former case, which we can check for once, then we can skip all the other validity checks:

if (1 <= x && x < width - 1 && 1 <= y && y < height - 1)
  int sum = cells[x - 1, y - 1] ? 1 : 0;
  sum += cells[x - 1, y] ? 1 : 0;
  return sum;
  // as before:
  int sum = this[x - 1, y - 1] ? 1 : 0;

Making this change takes us from 5000 ticks in 17 seconds to 4.4 seconds.That’s a 4x improvement right there!

That was no surprise at all, but my experiments with making further tweaks to the code were quite surprising. Before I get into the details, let me hasten to say: I have not actually looked at the generated assembly to figure out why the jitted code has non-obvious performance characteristics, but I have some educated guesses. Take everything below with a grain of salt, and if anyone cares to look at the generated machine code to confirm or deny my guesses, I’d be happy to make an update.

Here are some things I tried to see if they made an improvement or a regression; I’ll list the winners first:

I wondered if we were paying any penalty for all these conditionals:

int sum = cells[x - 1, y - 1] ? 1 : 0;

On the one hand, the Boolean should be represented as a byte that is either zero or one, so the jitter could make the conditional a no-op, right?

On the other hand, there are sneaky ways to make a Boolean contain a non-zero, non-one value, and those should all be treated as true, but that would make the “turn this conditional into a no-op” optimization invalid.

Like I said above, I haven’t checked the codegen to see what exactly it does. But when I turned the bool array into a byte array and removed the conditionals, that trimmed an additional 40 microseconds off of each step for a total of 200 milliseconds over 5000 steps.

Allocating a lot of new arrays has two costs: it has to initialize the new array to zero, and it causes collection pressure which increases the number of GCs. (Fortunately we are sneaking in under the 85K limit for arrays to avoid going on the Large Object Heap, but we’re very close, and that would change things as well.)

This algorithm fills in every element in the new array, so there’s no need for it to be initialized at all, or allocated and then immediately freed.

If we go to a solution where we cache the temporary array and re-use it, that shaves off another 200 milliseconds over 5000 steps.

When I was working on the nullable lowering code in the compiler I learned that if we were generating, say, an addition of two nullable integer values x and y, that you always generate the code as if the user wrote:

result = x.HasValue() & y.HasValue() ? ...

and not

result = x.HasValue() && y.HasValue() ? ...

Why’s that? Because checking whether x has a value and branching if it does not is more expensive than simply calling y.HasValue()! Avoiding the work is more expensive than just doing it. And moreover, every branch is a new basic block, and basic blocks make the jitter work harder and therefore potentially skip more optimizations; the jitter has to be fast.

I therefore wondered if I could rewrite

  if (1 <= x && x < width - 1 && 1 <= y && y < height - 1)

which after all is just

  if (1 <= x)
    if (x < width - 1)
      if (1 <= y)
        if (y < height - 1)

as instead

  if (1 <= x & x < width - 1 & 1 <= y & y < height - 1)

Turns out, that’s actually a small regression. This was suprising, particularly since the conditions are almost always true and therefore you don’t get much savings by avoiding work on the edges.

I’m not sure what’s going on here, but I have a guess. Let’s look at the other losers and then I’ll theorize further at the end.

As we’ve seen, the special-purpose code for handling the edges is a source of pain because we do extra work to not dereference outside the array. A standard technique for Life on finite grids is to make a “rectangle of death” along the edge. That is, allocate memory for a 256×256 grid, but only actually do the computations on the 254×254 grid on the interior, and keep the 1000 or so cells on the edges permanently dead. We waste a couple KB of memory, but we save time on not having to constantly check whether we’re computing neighbors of an edge cell because we just skip them entirely.

When I only restricted the computations to the interior by changing the loop conditions, of course we were now doing 1000 fewer cells than before, so there was a commensurate speedup. But here’s the weird bit. When I removed the now unnecessary:

if (1 <= x && x < width - 1 && 1 <= y && y < height - 1)

check, the program got slightly but consistently slower.

What the heck? It’s doing less work, so how did it get slower?

This code seems to be doing some recomputation:

int sum = this[x - 1, y - 1] ? 1 : 0;   
sum += this[x - 1, y] ? 1 : 0;

We recompute x-1, x+1, y-1 and y+1 several times.

When I put those into local variables to avoid recomputing them, again I got a small but consistent regression!

Finally, the one that was most surprising to me. It is fairly well-known that in .NET, the jitter does a better job optimizing operations on single-dimensional arrays than multi-dimensional arrays. I tried changing my cell array to a single-dimensional array, and then accessed it as

cells[x + y * width]

throughout. This also caused a small regression!

I have a hypothesis that explains these unexpected regressions.

Arrays are memory-safe in C#. Accessing a single-dimensional array outside its bounds throws an exception, and accessing a multi-dimensional array outside of any of its bounds is an exception even if the offset computed into the underlying memory block would be in range.

The jitter has a lot of smarts in it to avoid doing unnecessary bounds checks. In particular, if you have:

for(i = 0; i < a.Length; a += 1)
  sum += a[i]

Then the runtime knows that the array access inside the loop will always be in bounds, so it skips emitting a bounds check.

I suspect that what is going on here with all of these unexpected regressions is that the edit causes the jitter to fall out of a “known to be good” path and it starts generating more array bounds checks.

We are smart enough to know that the array indices will be in bounds, but the jitter only has a very small time budget to analyze a method body; as I noted above, it makes no sense to do a work-avoiding optimization if determining whether the optimization applies costs more than the work saved!

This is just a conjecture though; if anyone knows for sure what is going on here, I’d love to see a deeper analysis.

The obvious comment here is “can we avoid the checks entirely by going to unsafe code?” I’ll address that point in a later episode, so hold off there for now!

Summing up:

  • Our tweaked algorithm, which you can find in BoolArrayOpt.cs, goes from 17 seconds to 4 seconds for 5000 ticks, or 1250 ticks per second. That seems decent for an early attempt.
  • If we get 1250 TPS with an 8-quad, then we should get about 80 TPS with a 10-quad and about 20 with an 11-quad, so a 1024×1024 grid is about as good as we can get with this implementation if we want to meet our performance goal of animating smoothly. I have not actually tried it though; it would be interesting to see what the scaling factor really is.
  • The moral of the story here is performance of managed code can be surprising, so always measure! 

Next time on FAIC: there’s a (seemingly) completely different flavor of array-based solution to this problem that I’d like to spend an episode or two digging into.

13 thoughts on “Life, part 6

  1. With your first optimization, with the block `if (1 <= x && x < width – 1 && 1 <= y && y < height – 1) {..} else {…}` it looks like you still do that check once for each cell.

    What I would try after this would be to split the single nested loops into parts, somewhat similar to the "rectangle of death" you mention.
    – 1 loop for x in 1 … width -1 and y in 1… height -1 (the interior cells): all neighbours known to be good, so no checking needed
    – 4 loops for the edges, excluding the 4 corners: which neighbours are good is fixed for each case, so you simply leave out accessing the non-existent ones as you don't need them for the sum
    – finally the 4 corners, here also it is known which neighbours are valid.

    Of course to be really fast you can use a blitter…

    • Indeed, I tried eliminating the one-per-cell check and it got worse; I’m not sure why.

      I am considering doing an episode on an algorithm that does as you say — dispatches to special routines for the edges, which, on a large grid, are a small fraction of the area. (Though interestingly, this is not the case in higher dimensions! In higher dimensional rectangles there is more surface area than interior.)

    • I played around with that idea a little in “Step”, but that only made it worse. It did however cause a speedup in “LivingNeighbors”, basically “unrolling” the different parts and then directly accessing “cells” without any more checking. Something like:

      if (x == 0) { 
        if (y == 0) { [access 3 neighbours] } 
        else if (y == height - 1) { [access 3 neighbours] } 
        else { [access 5 neighbours] } } 
      else if (x == width - 1) { 
        if (y == 0) { [access 3 neighbours] } 
        else if (y == height - 1) { [access 3 neighbours] } 
        else { [access 5 neighbours] } } 
      else { 
        if (y == 0) { [access 5 neighbours] } 
        else if (y == height - 1) { [access 5 neighbours] } 
        else { [access 8 neighbours] } }

      which turns “LivingNeighbors” into a bit of a “LivingHell”, but it seems to be an almost 10% speedup.

      Another ever so slight speedup in “Step”:

      (count == 2 && cells[x, y] != 0)

      instead of the other way ’round.

      Any attempts with a Set or a sparse matrix (but I probably did that wrong) went horribly slower. So I’m very curious, which direction this will take.

      • Excellent point! There’s an algorithm that I might discuss at some point that solves the same problem you describe with the terrible “if” statement by storing an extra bit in every cell that indicates whether it is on an edge or not — that bit never changes for a fixed-size grid obviously — and then builds a little dispatch engine that branches directly to a slower routine that handles the edge conditions; the majority of the time it stays on the fast path with no comparisons.

        However it is a rather tedious algorithm so I might skip it.

  2. I’ve seen gains from rewriting bounds checks of the form:
    1 <= x && x < width – 1
    (uint) (x-1) < width – 2

    Definitely less clear than the original, but I was able to measure a slight gain in my particular case, which was some encoder/decoder loop.

      • I tested it against your example and saw… “a small regression”.
        Making height and width const showed an improvement though, ~6%. Would the JIT loop unroll when the range is const?

        • “Make them constants” was on my list of experiments to try and then I totally forgot! Good catch.

          The moral of the story that I’m working towards here is that the truly great optimizations will arrive when we consider the specifics of the Life domain, rather than over-focusing on making a slow algorithm faster. Thus my intention here is not actually to spend a lot of time sussing out how to maximize the changes of getting the jitter to do more optimizations, so I’m not going to dig into the details. But if you (or anyone else) want to look at the code that the jitter generates, I’d be happy to add a link.

  3. Coming from an image processing background, I’d be inclined to:

    1. Employ an well-known, highly optimized and tested image processing library.
    2. Do a convolution with a kernel [[1, 1, 1], [1, 9, 1], [1, 1, 1]]. Highly optimized image processing libraries have this as a primitive. This gives you a number 0..8 for a formerly dead cell giving its number of live neighbors, and 9..17 for a formerly live cell. Some edge processing may be necessary.
    2a. Alternatively, the same result can be obtained by taking a simple 3×3 blur over the original matrix, and adding the original matrix multiplied by 8. This might be faster than the general convolution.
    3. Do a table lookup mapping 3, 10, and 11 to 1, all other values to 0.

    • I have only a vaguest understanding of image convolutions, but I was also assuming something like this was the way to go. Basically one needs to find some way to turn this into something that you can easily throw at the GPU. GPUs are all about trying to process your pixels all at once, right?

    • I’m going to discuss a simpler technique — in the sense that it requires less “concept count” — for treating this as a matrix operation in the next episode, and I’m going to discuss various “keep the neighbour counts along with the state in the cell” after that.

      After I get through that I think I’ll have enough background material to discuss your technique, but I am very far from an expert on image processing so I might skip it. Thanks for the note, it was very useful!

  4. Pingback: The Morning Brew - Chris Alcock » The Morning Brew #2987

  5. We used to do game of life as an interview question. I don’t like a lot of my old code, but I think this is quite nice. I really like the readability of the F# version and the tests:

    I assume you will be getting to a set-based solution? Storing the grid has always seemed wasteful to me, as you don’t care about the dead cells, so why have memory space for them?

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s