Code for this episode is here.

Today we can finish off our C# implementation of Stafford’s algorithm. Last time we turned the first pass into a table lookup; it might be a bit harder to optimize the second pass. Let’s take a look at the inner loop of the second pass; I’ll take out the comments and the debugging sanity checks:

Triplet t = triplets[x, y]; if (!t.Changed) continue; if (t.LeftNext) BecomeAlive(x * 3, y); else BecomeDead(x * 3, y); if (t.MiddleNext) BecomeAlive(x * 3 + 1, y); else BecomeDead(x * 3 + 1, y); if (t.RightNext) BecomeAlive(x * 3 + 2, y); else BecomeDead(x * 3 + 2, y); changes.Add((x, y));

The key to understanding why this is very suboptimal is to look at what the code actually does in a common case. Let’s suppose the “current” triplet state is “dead-alive-alive” and the “next” triplet state is “alive-alive-dead”. That is, the left cell is going to *become* alive, the middle cell is going to *stay* alive, and the right cell is going to *die*. What happens? Let’s go through it line by line.

if (!t.Changed)

Do some bit twiddling to discover that the current state and the next state are in fact different.

if (t.LeftNext)

Do some bit twiddling to discover that the next state is “alive”. Call BecomeAlive. What does it do?

if (t.LeftCurrent) return false;

Do some more bit twiddling to discover that yes, a change needs to be made.

triplets[tx - 1, y - 1] = triplets[tx - 1, y - 1].UUP(); triplets[tx, y - 1] = triplets[tx, y - 1].PPU(); triplets[tx - 1, y] = triplets[tx - 1, y].UUP(); triplets[tx, y] = t.SetLeftCurrent(true); triplets[tx - 1, y + 1] = triplets[tx - 1, y + 1].UUP(); triplets[tx, y + 1] = triplets[tx, y + 1].PPU();

Do some bit twiddling to set the current bit to alive; do five additions to five neighbouring triplets to update their living neighbour counts — in parallel, for the cases where there are two counts to update — and return to the inner loop:

if (t.MiddleNext)

Do some bit twiddling to discover that the middle cell “next” state is alive, and call BecomeAlive:

if (t.MiddleCurrent) return false;

Do more bit twiddling to discover that the middle cell is already alive, so there’s no work to do. Go back to the inner loop:

if (t.RightNext)

Do some bit twiddling to discover that the right cell “next” state is dead; call BecomeDead:

if (!t.RightCurrent) return false;

Do some bit twiddling to discover that it is currently alive and so neighbour counts need to be updated.

triplets[tx, y - 1] = triplets[tx, y - 1].UMM(); triplets[tx + 1, y - 1] = triplets[tx + 1, y - 1].MUU(); triplets[tx, y] = t.SetRightCurrent(false); triplets[tx + 1, y] = triplets[tx + 1, y].MUU(); triplets[tx, y + 1] = triplets[tx, y + 1].UMM(); triplets[tx + 1, y + 1] = triplets[tx + 1, y + 1].MUU();

Twiddle bits to set the current bit to dead; update the neighbour counts of five neighbouring triplets, *two of which we just updated when we did the left cell!*

Also, did you notice that when we did the left cell, we *added* one to the middle neighbour counts of the above and below triplets, and just now we *subtracted* one from that same count? We are doing work and then undoing it a few nanoseconds later.

The observations to make here are:

- On the good side, the idempotency check at the top of the algorithm is quite cheap. The cost of having duplicates on the “current changes” list is really low. But everything else is terrible.
- We are doing way too much bit twiddling still, and
*it is almost all in the service of control flow*. That is, most of those bit twiddles are all in “if” statements that choose a path through the algorithm to selectively run different operations. - If the left and right cells both changed then we need to update all eight neighbouring triplets, but we actually do
*ten*updates. - Worse, if the left, right
*and*middle cells changed, we need to update eight neighbouring triplets but we actually would do*twelve*updates! - We twiddle bits to set the current state up to three times, but we know what the final current state is going to be; it is the same as the “next” state.

How can we fix these problems?

The thing to notice here is **there are only 27 possible control flows through this algorithm**.They are:

- The idempotency check says that we already processed this change. Continue the loop.
- The left and middle cells are unchanged; the right cell is going from dead to alive.
- The left and middle cells are unchanged; the right cell is going from alive to dead.
- The left and right cells are unchanged; the middle cell is going from dead to alive.
- … and so on; you don’t need me to list all 27.

Let’s come up with a notation; I’ll say that “A” is “was dead, becomes alive”, “D” is “was alive, becomes dead”, and “U” is “no change”. The 27 possible control flows are therefore notated UUU, UUA, UUD, UAU, UAA, UAD, UDU, UDA, UDD, and so on, up to DDD. There are three positions each with three choices, so that’s three-to-the-three = 27 possibilities.

Since there are only 27 control flows, let’s just make a helper method for each! The control flow I just spelled out would be notated AUD, and so:

private bool AUD(int tx, int y) { triplets[tx - 1, y - 1] = triplets[tx - 1, y - 1].UUP(); triplets[tx, y - 1] = triplets[tx, y - 1].PUM(); triplets[tx + 1, y - 1] = triplets[tx + 1, y - 1].MUU(); triplets[tx - 1, y] = triplets[tx - 1, y].UUP(); triplets[tx + 1, y] = triplets[tx + 1, y].MUU(); triplets[tx - 1, y + 1] = triplets[tx - 1, y + 1].UUP(); triplets[tx, y + 1] = triplets[tx, y + 1].PUM(); triplets[tx + 1, y + 1] = triplets[tx + 1, y + 1].MUU(); return true; }

There are eight neighbouring triplets that need updating; we update all of them exactly once. We will see why it returns a bool in a minute.

You might be wondering what happens for something like “DDU” — left dies, middle dies, right unchanged. In that case there are neighbouring cells in other triplets that need to have their counts adjusted by two. We could write:

private bool DDU(int tx, int y) { triplets[tx - 1, y - 1] = triplets[tx - 1, y - 1].UUM(); triplets[tx, y - 1] = triplets[tx, y - 1].MMM().MMU(); ...

But we notice here that we could save an addition by again, making the compiler fold the constant math:

private bool DDU(int tx, int y) { triplets[tx - 1, y - 1] = triplets[tx - 1, y - 1].UUM(); triplets[tx, y - 1] = triplets[tx, y - 1].M2M2M(); ...

where we define a helper on the triplet struct:

public Triplet M2M2M() => new Triplet(-2 * lcountone - 2 * mcountone - rcountone + triplet);

We can implement all 27 methods easily enough and we need 12 new helper methods to do the additions. These methods are all tedious, so I won’t labour the point further except to say that method UUU — no change — is implemented like this:

public bool UUU(int tx, int y) => false;

Meaning “there was no change”. All the other methods return true.

So it is all well and good to say that we should make one method for every workflow; how do we dispatch those methods?

We observe that there are only 64 possible combinations of current state and next state, and each one corresponds to one of these 27 workflows, so let’s just make another lookup;** this one is a lookup of functions.**

We will treat the current state as the low three bits of a six-bit integer, and the next state as the high three bits; that gives us our 64 possibilities. If, say, the next state is alive-alive-dead (AAD) and the current state is dead-alive-alive (DAA) then the change is “become dead-unchanged-become alive” (DUA):

lookup2 = new Func<int, int, bool>[] { /* NXT CUR */ /* DDD DDD */ UUU, /* DDD DDA */ UUD, /* DDD DAD */ UDU, /* DDD DAA */ UDD, /* DDD ADD */ DUU, /* DDD ADA */ DUD, /* DDD AAD */ DDU, /* DDD AAA */ DDD, /* DDA DDD */ UUA, /* DDA DDA */ UUU, /* DDA DAD */ UDA, /* DDA DAA */ UDU, /* DDA ADD */ DUA, /* DDA ADA */ DUU, /* DDA AAD */ DDA, /* DDA AAA */ DDU, /* DAD DDD */ UAU, /* DAD DDA */ UAD, /* DAD DAD */ UUU, /* DAD DAA */ UUD, /* DAD ADD */ DAU, /* DAD ADA */ DAD, /* DAD AAD */ DUU, /* DAD AAA */ DUD, /* DAA DDD */ UAA, /* DAA DDA */ UAU, /* DAA DAD */ UUA, /* DAA DAA */ UUU, /* DAA ADD */ DAA, /* DAA ADA */ DAU, /* DAA AAD */ DUA, /* DAA AAA */ DUU, /* ADD DDD */ AUU, /* ADD DDA */ AUD, /* ADD DAD */ ADU, /* ADD DAA */ ADD, /* ADD ADD */ UUU, /* ADD ADA */ UUD, /* ADD AAD */ UDU, /* ADD AAA */ UDD, /* ADA DDD */ AUA, /* ADA DDA */ AUU, /* ADA DAD */ ADA, /* ADA DAA */ ADU, /* ADA ADD */ UUA, /* ADA ADA */ UUU, /* ADA AAD */ UDA, /* ADA AAA */ UDU, /* AAD DDD */ AAU, /* AAD DDA */ AAD, /* AAD DAD */ AUU, /* AAD DAA */ AUD, /* AAD ADD */ UAU, /* AAD ADA */ UAD, /* AAD AAD */ UUU, /* AAD AAA */ UUD, /* AAA DDD */ AAA, /* AAA DDA */ AAU, /* AAA DAD */ AUA, /* AAA DAA */ AUU, /* AAA ADD */ UAA, /* AAA ADA */ UAU, /* AAA AAD */ UUA, /* AAA AAA */ UUU };

We can extract the key to this lookup table from a triplet with a minimum of bit twiddling:

public int LookupKey2 => triplet >> 9;

And we need one final bit twiddle to set the current state to the next state:

public Triplet NextToCurrent() => new Triplet((triplet & ~currentm) | ((triplet & nextm) >> 3));

Put it all together and our second pass loop body is greatly simplified in its action, though rather harder to understand what it is doing:

int key2 = triplets[x, y].LookupKey2; Func<int, int, bool> helper = lookup2[key2]; bool changed = helper(x, y); if (changed) { triplets[x, y] = triplets[x, y].NextToCurrent(); changes.Add((x, y)); }

We determine what work needs to be done via table lookup; if the answer is “none”, the helper returns false and we do nothing. Otherwise the helper does two to eight additions, each updating one to three neighbour counts in parallel. Finally, if necessary we update the current state to the next state and make a note of the changed triplet for next time.

We’re down to a single control flow decision, and almost all the bit operations are simple additions, so we will likely have considerably improved performance.

It was a long road to get here. I sure hope it was worth it…

Algorithm time(ms) ticks size(quad) megacells/s Naïve (Optimized): 4000 5K 8 82 Abrash 550 5K 8 596 Abrash w/ changes 190 5K 8 1725 Abrash, O(c) 240 5K 8 1365 Stafford v1 340 5K 8 963 Stafford v2 210 5K 8 1560 Stafford final 180 5K 8 1820

A new record for this series, so… yay?

Also, it is still an O(change) algorithm, and it is significantly more space-efficient at three cells per two bytes, instead of one cell per byte, so that’s good; we can store larger grids in the same amount of memory if we want.

**But oh my goodness, that was a lot of work to get around a 25% speedup over the one-byte-per-cell O(change) algorithm.**

Reading Abrash’s original article from the 1990s describing Stafford’s algorithm, it is a little difficult to parse out exactly how fast it is, and how much faster that is than Abrash’s original algorithm. If I read it right, the numbers are approximately:

- 200×200 grid of cells
- 20MHz-33MHz 486
- Measure performance of the computation, not the display (as we are also doing)
- Roughly 20 ticks per second for Abrash’s O(cells) “maintain neighbour counts” algorithm.
- Roughly 200 ticks per second for Stafford’s O(changes) implementation.

Dave Stafford attained a 10x speedup by applying optimizations:

- Store the next state, current state, and neighbour counts in a triplet short
- Maintain a deduplicated change list of previously-changed triplets
- Table lookup to compute next state of possibly-changed triplet
- Branch to a routine that does no more work than necessary to update neighbour counts

Obviously machines are a lot faster 25 years later. Stafford’s algorithm running on a 20MHz 486 does 200 ticks per second, and the same algorithm written in C# running on a 2GHz i7 does over 25000 ticks per second even with a 50%+ larger grid size. But we should expect that a machine with 100x clock speed, better processor caches, and so on, should do on the order of 100x more work per second.

(Aside: though we are doing 65536 cells in a grid instead of their 40000, Abrash and Stafford’s implementations also implemented wrap-around behaviour for the edges, which I have not; that complicates their implementations and I avoided the penalty of writing code to deal with those complications. So this is not a purely apples-to-apples comparison in many ways.)

What is a little surprising here is: my implementation of the two algorithms in C# shows a 3x speedup between the two algorithms that were originally compared, not a 10x speedup as was original observed.

Figuring out why exactly would require rather more analysis of the original code than I’m willing to do, but I have a few conjectures.

First, we don’t know what the test case was. If it had a small number of changes per tick then certainly an O(changes) algorithm is potentially much better than an O(cells) implementation. Perhaps the test case was biased towards O(changes) algorithms.

Second, I have implemented the *spirit* of both algorithms but there are a *lot* of differences. My favourite, by way of example, is that Stafford’s original machine code implementation does not use a lookup array for the second-pass optimization, like I do here. The triplet value *itself* (with some bits masked out) is the address of the code that is invoked. Let me repeat that in case it was not clear: it’s machine code so you have total authority to decide where in memory the helper functions are, and they are placed in memory such that *the triplet itself is a pointer to the helper function*. There is no table indirection penalty because there is no table.

These are the sorts of tricks that machine language programmers back in the 1990s lived for, and though I can admire it, that’s just not how we do stuff in high level languages. We could spend a bunch of time tweaking the C# versions of these algorithms to use pointers instead of offsets, or arrays with high water marks instead of lists, and so on. We’d get some wins, but they’d be marginal, and I’m not going to bother tuning this further. (As always, if anyone wants to, please make the attempt and leave a comment!)

The third conjecture is: yes, the rising tide of a 100x processor speed improvement lifted all boats, but it also changed the landscape of performance optimization because not all code is gated on processor speed. RAM latency, amount of available RAM, cache sizes, and so on, have also changed in that time but not at the same rate, and not to the same effect. Figuring out whether any of those changes had the effect of bringing these two algorithms’ performance closer together in the last 25 years is really hard to say.

**Next time on FAIC:** I want to do some experiments to look more directly at the scalability of these algorithms; do costs actually grow as O(change)? What happens if we make some larger grids where lots of stuff is changing?

**Coming up after that:** Every algorithm we’ve looked at so far uses some sort of two-dimensional array as the backing store, which limits us to grids that are, in the grand scheme of things, pretty small. Can we come up with storage solutions that are less constrained, and at what performance cost?

For today’s extra content: a *puffer* whose debris is *spaceships* is called a *rake*; the first rake discovered back in the 1970s is called *spacerake*. (Image from the wiki.)

Again we see that you can do interesting things by sticking a bunch of c/2 spaceships together. Here’s a wider view snapshot that maybe makes it a little more clear how it comes to resemble “raking” the plane:

And again, this is an example of a pattern which causes the number of changes per tick to grow linearly over time.

“We could spend a bunch of time tweaking the C# versions of these algorithms to use pointers instead of offsets…”

I did that for your previous StaffordTwo implementation and saw a pretty significant (~20%) reduction just from moving to pointers.

These were the results of that perf run:

AbrashChangeList:145

AbrashOneArray:170

StaffordOne:225

StaffordTwo:152

StaffordTwoPointers:119

Minor typo:

new Func[1 >> 6] // shifting wrong direction

And now I wonder if this typo is due to having to html-encode the angle brackets. Because the comment software ate some of mine.

No, it was a genuine typo; thanks for the note and for doing the perf tests!

20% marginal gain for unmanaged code is a little on the high end of what I would expect, but certainly reasonable.

I suspect that a fair amount of cost is also due to my choosing a two-d array, as the CLR is not heavily optimized for that case.