Life, part 26

Last time on FAIC we saw how we could start with nine 2-quads — a 12×12 grid of cells — and with only a handful of ands, ors and table lookups we could get an 8×8 grid of cells of the next state out. How is that useful?

In order to explain, we need two more data structures, and it should come as no surprise that they are Quad3 and Quad4. Quad3, as I mentioned last time, is trivial; it is just a struct wrapping four Quad2s. The implementation is simpler than Quad2:

struct Quad3
  public Quad3(Quad2 nw, Quad2 ne, Quad2 sw, Quad2 se)
    NW = nw;
    NE = ne;
    SW = sw;
    SE = se;
  public Quad2 NW { get; }
  public Quad2 NE { get; }
  public Quad2 SW { get; }
  public Quad2 SE { get; }
  public bool Get(int x, int y) { ... }
  public Quad3 Set(int x, int y) { ... }
  public Quad3 Clear(int x, int y) { ... }
  public override string ToString() { ... }

We have yet another immutable struct. It’s small — only 64 bits. The Get, Set, Clear and ToString implementations are the obvious and straightforward extensions of the similar methods I already showed for Quad2, so I won’t bother showing them here. In a few episodes we will need to add some helper methods to Quad3, but for now, this will do us just fine.

The interesting data structure is Quad4, which by rights should represent a 16×16 grid of cells. The Quad4 has some surprises:

  • A Quad4 is a mutable class, not an immutable struct.
  • A Quad4 is not four Quad3s. It is eight! A Quad4 represents two ticks worth of a 16×16 portion of the grid. We have four Quad3s for the even-numbered generations, and four for the odds.
  • Each Quad4 is numbered with a unique (x, y) pair of signed short integers. That gives us a square of Quad4s 65536 on a side; since each Quad4 is 16 cells wide and tall, we have a total grid size of just over one million cells on a side; that’s a 20-quad. That’s pretty big. Of course, we will typically allocate nowhere even close to that many Quad4s.
  • Each Quad4 has a reference to six neighbouring Quad4s; specifically, it knows what Quad4s are to its north, south, west, east, northwest and southeast. A null reference is to be treated as though those cells are all dead. (Why not northeast and southwest? We shall see.)

There are more surprises to come later. Let’s write some code. It’s pretty straightforward:

sealed class Quad4
  public Quad4(int x, int y)
    X = (short)x;
    Y = (short)y;

  // A unique pair of shorts for each 4-quad in a 20-quad.
  public short X { get; }
  public short Y { get; }

  // References to neighbouring Quad4s:
  public Quad4 N { get; set; }
  public Quad4 S { get; set; }
  public Quad4 E { get; set; }
  public Quad4 W { get; set; }
  public Quad4 NW { get; set; }
  public Quad4 SW { get; set; }

  // Cells
  private Quad3 evenNW;
  private Quad3 evenSW;
  private Quad3 evenNE;
  private Quad3 evenSE;
  private Quad3 oddNW;
  private Quad3 oddSW;
  private Quad3 oddNE;
  private Quad3 oddSE;

  public bool GetEven(int x, int y) { ... }
  public void SetEven(int x, int y) { ... }
  public void ClearEven(int x, int y) { ... }

  public bool GetOdd(int x, int y) { ... }
  public void SetOdd(int x, int y) { ... }
  public void ClearOdd(int x, int y) { ... }

  public string override ToString() { ... }

Once again those last seven methods are (for now!) just trivial rewrites of the equivalent methods on a Quad2, so I’ll omit them here.

Obviously some controlling code representing the entire grid needs to create some number of Quad4s, assign numbers to them, set up the references, and so on. We’ll write that code in a later episode; right now I want to continue concentrating on the key problem of stepping to the next tick.

Suppose we are currently on an even generation and we want to step all the Quad4s in the system to the next generation, which will of course be odd. We add this method to Quad4 and have the controlling code call it for every Quad4:

public void StepEven()

Pretty straightforward; we step every even Quad3 forward one tick, and presumably that fills in the odd Quad3s. How are we going to do that though?

Time for more diagrams.

Again I’ve just randomly chosen a fragment of “acorn”, this time from generation 416. Let’s suppose the 4-quad I’ve marked as “this” is the one we are attempting to step forward one generation. (Click on any image for a higher-res version if one is available.)

Let’s zoom in a bit on “this” and give expressions that refer to some interesting Quad2s:

Well would you look at what we have here: nine Quad2s. If we step those forward one step with our handy super-efficient method from last episode, we get the next state of the green Quad3:

The code could not be easier:

private void StepEvenNW()
  oddNW = Step9Quad2ToQuad3(
    evenNW.NW, evenNW.NE, evenNE.NW,
    evenNW.SW, evenNW.SE, evenNE.SW,
    evenSW.NW, evenSW.NE, evenSE.NW);

I’m going to keep this in its own method though, as I’ll be adding more gear to it in an upcoming episode.

That’s the easy one, and I don’t think I need to labour the point with more diagrams. In order to step forward the other Quad3s we need to be careful about the fact that this.E, this.SE and this.S could be null, but if they are, we just substitute in an all-dead Quad2, which we helpfully made a static field for last time:

private void StepEvenSW()
  oddSW = Step9Quad2ToQuad3(
    evenSW.NW, evenSW.NE, evenSE.NW,
    evenSW.SW, evenSW.SE, evenSE.SW,
    S == null ? AllDead : S.evenNW.NW,
    S == null ? AllDead : S.evenNW.NE,
    S == null ? AllDead : S.evenNE.NW);
private void StepEvenNE()
  oddNE = Step9Quad2ToQuad3(
    evenNE.NW, evenNE.NE, E == null ? AllDead : E.evenNW.NW,
    evenNE.SW, evenNE.SE, E == null ? AllDead : E.evenNW.SW,
    evenSE.NW, evenSE.NE, E == null ? AllDead : E.evenSW.NW);
private void StepEvenSE()
  oddSE = Step9Quad2ToQuad3(
    evenSE.NW, evenSE.NE, E == null ? AllDead : E.evenSW.NW,
    evenSE.SW, evenSE.SE, E == null ? AllDead : E.evenSW.SW,
    S == null ? AllDead : S.evenNE.NW,
    S == null ? AllDead : S.evenNE.NE,
    SE == null ? AllDead : SE.evenNW.NW);

And when StepEven returns, we have the next state for the odd generation:

Wait a minute. We have committed the classic programmer blunder of being consistently off by one! We haven’t computed the next generation of the even 4-quad; we’ve computed the next generation of the 4-quad that is the even 4-quad shifted one cell to the southeast.

Well golly.

What are we going to do about this?

Give that some thought and then scroll down.








In my youth I was in a group of friends that watched terrible 80s horror movies every week, and we noticed patterns. In particular, in Killer Mutant Critter movies and zombie movies there is often a sheriff, and there are often concerned townsfolk, and the dialog seems to go

“Sheriff! Giant mutant toads are attacking the village! What are you going to do about it?!”


That’s what we’re going to do about it. Nuthin’.

This is our final surprise for this episode. A Quad4 represents a 16×16 section of the grid, for two generations, where the odd numbered generation is offset by one cell to the southeast from the even generation. Though this will make life very slightly difficult for the display code, this “staggered generations” property turns out to be surprisingly useful; we will see why in an upcoming episode.

Next time on FAIC: The next problem we need to solve is: how are we going to do the odd generation steps? We can’t just do the same thing because then we’ll be two cells off. Somehow we need to have the odd step get back to the northwest.

Life, part 25

Let’s crisp up the problem from last episode. The problem for today’s episode is to write a method that takes these nine 2-quads:

Computes these sixteen 1-quads:

And returns these four 2-quads:

Those four 2-quads form a 3-quad; I haven’t written a 3-quad data structure yet, but I’m sure you can imagine how it goes. We’ll write it out later; it’s just a struct containing four 2-quads. I like to start by writing the signature of the method:

static Quad3 Step9Quad2ToQuad3(
  Quad2 nw, Quad2 n, Quad2 ne,
  Quad2 w, Quad2 c, Quad2 e,
  Quad2 sw, Quad2 s, Quad2 se)

The first step is to get those twelve additional 2-quads. The first four we need are these:

Start at the beginning: how are we going to get the western 2-quad on the northern edge? Plainly it is made up of parts of the 2-quads I’ve labeled NW and N. A couple episodes ago I added an “or” operator to 2-quads and masks that get the eastern and western edges, so how about this?

Quad2 n_w = nw.EastEdge | n.WestEdge;

Do you see why that’s not quite right?  That is this 2-quad:

We’ve pasted the east edge of nw and the west edge of n together, but we also need to swap eastness with westness. The result we’ve got is mirror-imaged-at-the-level-of-1-quads the 2-quad that we want. Make sure that this step is clear in your head because correctly understanding how this is “mirrored” is important in this algorithm.

We’re going to be doing these operations a lot in this episode, so in keeping with my general attitude of “there’s nothing too small to go in a method”, I’m going to make two helper methods on Quad2 right now; one creates a problem and one fixes it:

public Quad2 HorizontalMiddleMirrored(Quad2 right) =>
  EastEdge | right.WestEdge;
public Quad2 Mirror =>
  new Quad2((ushort)((cells << 8) | (cells >> 8)));

Now we can write correct code for extracting the n_w 2-quad:

Quad2 n_w = nw.HorizontalMiddleMirrored(n).Mirror;

Why not combine these into one helper function HorizontalMiddle? Normally I would, but… we will come back to this point in a bit.

So far we’ve done two ands, two ors and two shifts. We can do three times that work to similarly get:

Quad2 n_e = n.HorizontalMiddleMirrored(ne).Mirror;
Quad2 c_w = w.HorizontalMiddleMirrored(c).Mirror;
Quad2 c_e = c.HorizontalMiddleMirrored(e).Mirror;

We now have eight of the sixteen 2-quads that we need. I’m sure you see how we can get these four:

We need to do the same thing again but this time for the vertical direction. And this time the 2-quads are going to come out flipped north-for-south instead of mirrored east-for-west. Two more helper methods!

public Quad2 VerticalMiddleFlipped(Quad2 bottom) =>
  SouthEdge | bottom.NorthEdge;
public Quad2 Flip => 
  new Quad2((ushort)(((cells & SouthEdgeMask) << 4) | 
    ((cells & NorthEdgeMask) >> 4)));

and now we can compute those four:

Quad2 w_n = nw.VerticalMiddleFlipped(w).Flip;
Quad2 w_s = w.VerticalMiddleFlipped(sw).Flip;
Quad2 c_n = n.VerticalMiddleFlipped(center).Flip;
Quad2 c_s = center.VerticalMiddleFlipped(s).Flip;

We’re 75% of the way to getting our sixteen needed 2-quads. How are we going to get these?

Three of them are pretty easy; they can be constructed from what we already have in hand.

Quad2 c_nw = w_n.HorizontalMiddleMirrored(c_n).Mirror;
Quad2 c_sw = w_s_flip.HorizontalMiddleMirrored(c_s).Mirror;
Quad2 c_ne = n_e.VerticalMiddleFlipped(c_e).Flip;

That last one is a little tricky but we can get it:

Quad2 s_e = s.HorizontalMiddleMirrored(se).Mirror;
Quad2 c_se = c_e.VerticalMiddleFlipped(s_e).Flip;

So far we’ve done 26 each of and, or, shift, to get the 16 2-quads we need.

We supposed that we had a fast lookup table that took a 2-quad and returned a 1-quad; let’s actually build that. The 2-quad key is a ushort and the returned 1-quad is an integer where the low four bits are the bits of the 1-quad:

static int[] lookup = new int[65536];
... initialized as ...
for (int i = 0; i <= 0xffff; i += 1)
  lookup[i] = StepQuad2(new Quad2((ushort)i));

We can therefore solve our problem with this mess:

Quad2 newNW = new Quad2((ushort)(
  (lookup[(ushort)nw] << 12) | // NW is bits 12-15
  (lookup[(ushort)w_n] << 8) | // SW is 8-11
  (lookup[(ushort)n_w] << 4) | // NE is 4-7
  (lookup[(ushort)c_nw])));    // SE is 0-3
Quad2 newNE = new Quad2((ushort)(
  (lookup[(ushort)n] << 12) |
  (lookup[(ushort)c_n] << 8) |
  (lookup[(ushort)n_e] << 4) |
Quad2 newSW = new Quad2((ushort)(
  (lookup[(ushort)w] << 12) |
  (lookup[(ushort)w_s] << 8) |
  (lookup[(ushort)c_w] << 4) |
Quad2 newSE = new Quad2((ushort)(
  (lookup[(ushort)c] << 12) |
  (lookup[(ushort)c_s] << 8) |
  (lookup[(ushort)c_e] << 4) |
return new Quad3(newNW, newNE, newSW, newSE);

What I like about this: it would work.

What I dislike about this: we’re back in the mechanism domain. We are treating 1-quads as ints and constantly going back and forth between ushorts and Quad2s. The implementation detail of how a Quad2 is laid out in memory is once again front and center in this algorithm. Also, it feels like we are doing more work than we need to throughout this algorithm.

And we are wasting a bunch of space in the lookup table. We have 65536 32-bit integers but we are only using 4 of those bits. That seems like a small problem; memory is cheap after all but… wait a minute… what if we could fix several of these problems at once?

The first step in fixing this mess is: we’re going to make the lookup table an array of Quad2s, where each Quad2 contains four copies of the 1-quad:

static Quad2[] lookup = new Quad2[65536];
... initialized as ...
for (int i = 0; i <= 0xffff; i += 1)
  int n = StepQuad2(new Quad2((ushort)i));
  lookup[i] = new Quad2((ushort)(
    (n << 12) | (n << 8) | (n << 4) | n));
static Quad2 Lookup(Quad2 q) => lookup[(ushort)q];

I mean, why not? We have plenty of bits to spare, so use them.

Now the call site looks like this:

Quad2 newNW =  Lookup(nw).NW | Lookup(w_n).SW | Lookup(n_w).NE | Lookup(c_nw).SE;
Quad2 newNE =  Lookup(n).NW  | Lookup(c_n).SW | Lookup(n_e).NE | Lookup(c_ne).SE;
Quad2 newSW =  Lookup(w).NW  | Lookup(w_s).SW | Lookup(c_w).NE | Lookup(c_sw).SE;
Quad2 newSE =  Lookup(c).NW  | Lookup(c_s).SW | Lookup(c_e).NE | Lookup[c_se].SE;
return new Quad3(newNW, newNE, newSW, newSW);

Look at that improvement in the readability of the code! We plainly see that we are constructing new Quad2s out of the parts of old ones. We’re keeping all the typing consistent. This is great.

Could it be even better?

Take a look at those sixteen computations of a quad1 on the next tick. You notice something?

  • All the ones that are going in the NW of a new 2-quad are lookups on an argument that was passed in to this method: nw, n, w and c.
  • All the ones that are going to the SW were computed by “flipping” something: w_n, c_n, w_s and c_s.
  • All the ones that are going to the NE were computed by “mirroring” something: n_w, n_e, c_w and c_e
  • All the ones going to the SE were both flipped and mirrored: c_nw, c_ne, c_sw, c_se.

What does this mean?

It means we can move the flipping and mirroring operations back in time into the lookup table as well. Read this next bit carefully because it is tricky. We’re going to initialize the lookup table like this:

for (int i = 0; i <= 0xffff; i += 1)
  Quad2 normal = new Quad2((ushort)i);
  Quad2 mirror = normal.Mirror;
  Quad2 flip = normal.Flip;
  Quad2 both = mirror.Flip;
  int result = StepQuad2(normal);
  lookup[(ushort)both] |= new Quad2((ushort)result);           // SE
  lookup[(ushort)mirror] |= new Quad2((ushort)(result << 4));  // NE
  lookup[(ushort)flip] |= new Quad2((ushort)(result << 8));    // SW
  lookup[(ushort)normal] |= new Quad2((ushort)(result << 12)); // NW

And now we can remove all the flips and mirrors when computing the new Quad2s! Put it all together:

static Quad3 Step9Quad2ToQuad3(
  Quad2 nw, Quad2 n, Quad2 ne,
  Quad2 w, Quad2 c, Quad2 e,
  Quad2 sw, Quad2 s, Quad2 se)
  // These are mirrored
  Quad2 n_w = nw.HorizontalMiddleMirrored(n);
  Quad2 n_e = n.HorizontalMiddleMirrored(ne);
  Quad2 c_w = w.HorizontalMiddleMirrored(c);
  Quad2 c_e = c.HorizontalMiddleMirrored(e);
  // These are flipped
  Quad2 w_n = nw.VerticalMiddleFlipped(w);
  Quad2 c_n = n.VerticalMiddleFlipped(c);
  Quad2 w_s = w.VerticalMiddleFlipped(sw);
  Quad2 c_s = c.VerticalMiddleFlipped(s);
  // These are mirrored and flipped
  Quad2 c_nw = w_n.HorizontalMiddleMirrored(c_n);
  Quad2 c_ne = n_e.VerticalMiddleFlipped(c_e);
  Quad2 c_sw = w_s.HorizontalMiddleMirrored(c_s);
  Quad2 c_se = c_e.SouthEdge | c_s.NE | se.NW;
  Quad2 newNW =  Lookup(nw).NW | Lookup(w_n).SW | Lookup(n_w).NE | Lookup(c_nw).SE;
  Quad2 newNE =  Lookup(n).NW  | Lookup(c_n).SW | Lookup(n_e).NE | Lookup(c_ne).SE;
  Quad2 newSW =  Lookup(w).NW  | Lookup(w_s).SW | Lookup(c_w).NE | Lookup(c_sw).SE;
  Quad2 newSE =  Lookup(c).NW  | Lookup(c_s).SW | Lookup(c_e).NE | Lookup[c_se].SE;
  return new Quad3(newNW, newNE, newSW, newSW);

Computing c_se actually gets easier, which is a nice bonus.

What’s our final tally? To move ahead an 8×8 portion of a 12×12 grid we do 41 ands, 25 ors and 16 array lookups. And zero shifts. All the other bit operations have been moved backwards in time to run during the computation of the lookup table.

Could we do even less work?

Yes; we could eliminate 16 of those ands if we had four lookup tables each with four bits in the right place and zeros everywhere else; we then would not have to do any masking. It would just be:

  Quad2 newNW =  LookupNW(nw) | LookupSW(w_n) | LookupNE(n_w) | LookupSE(c_nw);

But that’s not how the original implementation did it, and so that’s not what I’m doing here. It would be an interesting experiment to find out of that makes a difference; maybe we will come back to that point later in the series, but probably not.

Oh, and incidentally, you remember that fragment of bit-twiddling code that I posted from the original implementation?

int ix01 = (ix00 & 0xf0f0) | (cN.q[14] & 0x0f0f);
int ix11 = (ix10 & 0xf0f0) | (cN.q[15] & 0x0f0f);
int ix03 = (ix02 & 0xf0f0) | (ix01 & 0x0f00) | (cN.q[7] & 0x000f);

I bet you can figure out what it does now!

Well that was quite the journey to get a fast method that can step an 8×8 fragment of a 12×12 grid forwards one tick by memoizing the “4×4 step to 2×2” problem. Why is this method useful?

Next time on FAIC: Yeah, why?

Life, part 24

When last we left off we had representations for 0-quads — a single bit — 1-quads — four bits in a row — and 2-quads — a wrapper data structure around an unsigned short that could extract single cells or mask out any of the component 1-quads “in place”. The question in the cliffhanger was: how can we move a 2-quad — a 4×4 grid of cells — forwards one generation?  Should we just pretend that there is a rectangle of empty cells around it?

No. Instead, we redefine what it means to step one tick forward. Stepping a 2-quad forward one tick results in a 1-quad; the center one-quad stepped one tick forwards.

Let me make sure that’s clear. What I’m saying is that if we have this 2-quad, just to pick one at random:

And we step it forwards one tick, we assume that this 2-quad is embedded in a larger grid and we do not know the true neighbour counts of the twelve cells on the edge. But we do have enough information to step the center four cells forwards, and that gives us a 1-quad:

The implementation is straightforward; we’ve seen it a bunch of times. Since I don’t have a custom-built data structure for a 0-quad or a 1-quad and they are just one bit or four bits, let’s just put them in an integer and we’ll figure out what to do with them later. (Recall that I said that bits 3, 2, 1, and 0 were the northwest, northeast, southwest and southeast cells of a 1-quad.)

private static int Rule(int cell, int count)
  if (count == 2)
    return cell;
  if (count == 3)
    return 1;
  return 0;
private static int StepQuad2(Quad2 q)
  int b00 = q.Get(0, 0) ? 1 : 0;
  int b01 = q.Get(0, 1) ? 1 : 0;
  int b02 = q.Get(0, 2) ? 1 : 0;
  int b03 = q.Get(0, 3) ? 1 : 0;
  int b10 = q.Get(1, 0) ? 1 : 0;
  int b11 = q.Get(1, 1) ? 1 : 0;
  int b12 = q.Get(1, 2) ? 1 : 0;
  int b13 = q.Get(1, 3) ? 1 : 0;
  int b20 = q.Get(2, 0) ? 1 : 0;
  int b21 = q.Get(2, 1) ? 1 : 0;
  int b22 = q.Get(2, 2) ? 1 : 0;
  int b23 = q.Get(2, 3) ? 1 : 0;
  int b30 = q.Get(3, 0) ? 1 : 0;
  int b31 = q.Get(3, 1) ? 1 : 0;
  int b32 = q.Get(3, 2) ? 1 : 0;
  int b33 = q.Get(3, 3) ? 1 : 0;
  int n11 = b00 + b01 + b02 + b10 + b12 + b20 + b21 + b22;
  int n12 = b01 + b02 + b03 + b11 + b13 + b21 + b22 + b23;
  int n21 = b10 + b11 + b12 + b20 + b22 + b30 + b31 + b32;
  int n22 = b11 + b12 + b13 + b21 + b23 + b31 + b32 + b33;
  int r11 = Rule(b11, n11);
  int r12 = Rule(b12, n12);
  int r21 = Rule(b21, n21);
  int r22 = Rule(b22, n22);
  return (r21 << 0) | (r11 << 1) | (r22 << 2) | (r12 << 3);

Now I know what you’re thinking; we’re doing sixteen array lookups for the masks, then sixteen bit operations, then umpteen dozen additions, and then a bunch of bit shifts and bit ors, and this all seems very expensive. But remember the lesson of our first-pass optimization to Stafford’s algorithm: we can do all 65536 computations for all possible 2-quads once before we start and store the resulting 1-quads in a lookup table! The cost then becomes a single lookup when it comes time to actually run a “step forwards” operation.

We’ll get to the details of the lookup table in a later episode; for now, let’s assume that we have a fast lookup for getting the center 1-quad of a 2-quad stepped one ahead. What could we do with this?

Buckle up because this is where we really start to need some diagrams.

The next big insight that drives this algorithm is the question: given the gear we have so far, what could we do if we had a 3×3 square of 2-quads?

Let’s just grab a configuration at random; this is a 12×12 fragment from the 151st generation of acorn.

I’m going to label the nine 2-quads according to their compass directions from the center 2-quad:

I want to focus your attention in particular on the 3-quad that is the upper left group of four 2-quads. What if we moved forward the NW, N, W and center 2-quads by one tick? We’d get four 1-quads.

Here is the same fragment, but this time from the 152nd generation of acorn with those four 1-quads marked:

We are making some progress here. The next insight is: what could we do if we had the same cells but we drew the 2-quad boundaries slightly differently? What if we drew them like this?

The labels here are N-W is the 2-quad that is on the western side of the northern edge, N-E is on the eastern side, and similarly for center-E and center-W.

If we moved those four 2-quads forwards one step, we’d get four more 1-quads:

I’m sure you see how this is going, but just to labour the point, let’s see it through. If we draw the 2-quad boundaries like this:

Then we can move those four 2-quads forwards one tick to get these portions of the next tick:

And finally, if we can divide up the grid like this:

Then we can get the next tick at:
Put it all together and what have we got?

If we have a method that takes nine 2-quads then:

  • We can generate from those nine 2-quads twelve other 2-quads that are portions of the interiors of those nine
  • Of the 21 2-quads we now have in hand, move 16 of them forwards one step
  • That generates these 16 1-quads:

  • Which is enough information to construct a 3-quad in the next generation.

I suspect that you have two questions on your mind right about now:

  • How on Earth are we going to do all those steps — slice up those nine quads into twelve more, step sixteen of them, and then somehow put those sixteen four-bit integers into some data structure that is useful to us — efficiently?
  • Why is getting a particular 8×8 portion of a 12×12 grid stepped forwards one tick useful?

Next time on FAIC: We’ll try to write some code to answer that first question.




I went out to Shilshole Bay Marina Tuesday night to get a few photos of the comet; it is quite spectacular! If you’re going stargazing this week, bring binoculars, look to the northwest about an hour after sunset, below the Big Dipper.

And don’t forget to turn around and look the other way because Jupiter and Saturn are in opposition and very bright, just left of Sagittarius.

Click on photos for a higher resolution version.

Life, part 23

This series is getting much longer than I expected, but I’m having a great time, so let’s keep it going. I want to look at two more algorithms; for the next few episodes we’ll look at the brilliant and subtle QuickLife by Alan Hensel.

This algorithm is a master class in bit twiddling solutions, and rather difficult to understand compared to the algorithms we’ve seen so far. The source code, unfortunately, does not make it easier. A typical fragment from the original 1990s-era Java source code, linked above, is:

int ix01 = (ix00 & 0xf0f0) | (cN.q[14] & 0x0f0f);
int ix11 = (ix10 & 0xf0f0) | (cN.q[15] & 0x0f0f);
int ix03 = (ix02 & 0xf0f0) | (ix01 & 0x0f00) | (cN.q[7] & 0x000f);

What on Earth is this code doing? You can’t read it in isolation and have any clue. What is “q”? What are these masks for? Why are 14, 15 and 7 important? What do the variables represent?

Long-time readers of this blog know that when I write code, I always try to make the code emphasize ideas in the business domain of the program. The Java implementation of QuickLife strongly emphasizes the mechanism domain; it is a program all about twiddling bits. In the code fragment above the only hint of business domain is that cN is a block of cells to the north of something else. The code is also rather hard to understand because there are practices such as “if” statements with 700 lines between the condition and the “else”.

Now, to be fair, the point of the code was not pedagogic clarity, and Java’s lack of properties and user-defined value types makes it more difficult to capture bit twiddling in efficient unboxed types. Fort the purposes of this blog I do want to be pedagogically clear though, so I’m going to implement the same algorithm in C#, but broken down into small parts that are individually clear.

Also, I’m going to try something a little different this time; I’m not going to describe the overall algorithm or data structures up front. Rather, we’ll work our way up from tiny little boards to big ones, adding features and efficiencies as we go. And we’ll also see why throughout this series I’ve been pushing this notion of Life grids that are “quads” — squares where the side is a power of two.

Let’s start as small as we can. We represent a board with a single cell — a 0-quad — as a single bit. I’m not going to make a data structure for that.

A board with four cells — a 1-quad — is a four-bit unsigned integer. We’ll call the four bits “SE”, “SW”, “NE” and “NW” for obvious reasons. That is, bits 0, 1, 2 and 3 of our four-bit integer are logically:

  3 2
  1 0

I’m also not going to make a data structure for a 1-quad.

A 2-quad is four 1-quads, so that’s 16 bits. We can represent that as a 16 bit unsigned short integer, and this time I am going to make a data structure.

The convention we’ll use is slightly different than the convention we used for 1-quads (and frankly I am not sure why that is; I am just following the conventions of the code as it was originally written!) For a 2-quad, the low four bits are the southeast corner, then the next four are northeast, then southwest, then northwest. That is, the bit numbers in a 2-quad are:

    NW         NE
      f e | 7 6
      d c | 5 4
      b a | 3 2
      9 8 | 1 0
    SW         SE 

I am going to make a data structure for this one! First thing is: I need a way to turn a ushort into one of these, and to turn it back into a ushort:

struct Quad2
  private readonly ushort cells;
  public Quad2(ushort cells) => this.cells = cells;
  public static explicit operator ushort(Quad2 x) => x.cells;

And now I am regretting my decision early on in this series to propose the jargon “n-quad” when I could have proposed “quad-n”. “2Quad” is not a legal identifier in C#. Forgive me if from now on I go back and forth between “n-quad” and “quad-n” for the rest of this series.

A default value of “all dead” and a predicate to check for all-deadness are going to come in handy later, so I’ll just add them now.

  public static Quad2 AllDead = new Quad2(0);
  public bool Dead => cells == 0;

We will need to access the component 1-quads, but rather than returning 4-bit integers, it turns out that in this algorithm we can almost always simply get the 4-quad masked out in place; that is, the NW property of a Quad2 should be a Quad2 that is zero everywhere but the northwest.

This smart insight — that the operations on 1-quads can be done “in place” rather than having to mask-and-bit-shift them out — is one of the keys to the performance of this algorithm, so keep that in mind as we go through the next few episodes.

As we will see in the next episode, we will also need to do the same for the 2×4 and 4×2 edges on each side:

  const uint NWMask = 0xf000;
  const uint SWMask = 0x0f00;
  const uint NEMask = 0x00f0;
  const uint SEMask = 0x000f;
  const uint EastEdgeMask = NEMask | SEMask;
  const uint WestEdgeMask = NWMask | SWMask;
  const uint SouthEdgeMask = SWMask | SEMask;
  const uint NorthEdgeMask = NWMask | NEMask;
  public Quad2 NW => new Quad2((ushort)(cells & NWMask));
  public Quad2 NE => new Quad2((ushort)(cells & NEMask));
  public Quad2 SW => new Quad2((ushort)(cells & SWMask));
  public Quad2 SE => new Quad2((ushort)(cells & SEMask));
  public Quad2 NorthEdge => new Quad2((ushort)(cells & NorthEdgeMask));
  public Quad2 SouthEdge => new Quad2((ushort)(cells & SouthEdgeMask));
  public Quad2 EastEdge => new Quad2((ushort)(cells & EastEdgeMask));
  public Quad2 WestEdge => new Quad2((ushort)(cells & WestEdgeMask));

That does it for the component 1-quads, but we will also need to get at individual cells. We should talk about coordinates. Let’s say that for a 2-quad, the convention is that (0, 0) is the far southwest corner and (3, 3) is the far northeast corner, as usual.

Of course I want structs to be immutable. I’m going to split “set” and “clear” up into two operations, one which sets a cell to alive and one which sets it to dead. I’ll omit the error checking that ensures that the coordinates are (0, 0) through (3, 3):

  private static readonly uint[] masks = new uint[]
    1 << 0x9, 1 << 0x8, 1 << 0x1, 1 << 0x0,
    1 << 0xb, 1 << 0xa, 1 << 0x3, 1 << 0x2,
    1 << 0xd, 1 << 0xc, 1 << 0x5, 1 << 0x4,
    1 << 0xf, 1 << 0xe, 1 << 0x7, 1 << 0x6
  public bool Get(int x, int y) => 
    (cells & masks[x + y * 4]) != 0;
  public Quad2 Set(int x, int y) => 
    new Quad2((ushort)(cells | masks[x + y * 4]));
  public Quad2 Clear(int x, int y) => 
    new Quad2((ushort)(cells & ~masks[x + y * 4]));

In the next episode I’m also going to need to be able to “or” two Quad2s together.

  public static Quad2 operator |(Quad2 x, Quad2 y) => 
    new Quad2((ushort)(x.cells | y.cells));

And for debugging purposes, we can dump out a Quad2 in the “cells” file format:

  public override string ToString()
    string s = "";
    for (int y = 3; y >= 0; y -= 1)
      for (int x = 0; x < 4; x += 1)
        s += this.Get(x, y) ? 'O' : '.';
      s += "\n";
    return s;

We will need just a few more operations in the next episode, but we’ll motivate them then.

I know, this looks like a lot of code to write for what is a 16 bit unsigned short integer, but remember that fragment of bit twiddling code from the original Java source?

int ix01 = (ix00 & 0xf0f0) | (cN.q[14] & 0x0f0f);
int ix11 = (ix10 & 0xf0f0) | (cN.q[15] & 0x0f0f);
int ix03 = (ix02 & 0xf0f0) | (ix01 & 0x0f00) | (cN.q[7] & 0x000f);

I already like it a lot better as:

Quad2 ix01 = ix00.NorthEdge | cN.q[14].SouthEdge;
Quad2 ix11 = ix10.NorthEdge | cN.q[15].SouthEdge;
Quad2 ix03 = ix02.NorthEdge | ix01.SW | cN.q[7].SE;

In this new form you can look at the code and, though it is still not clear what “ix01” and “q” and “14” mean, at least we now know what is happening: we are constructing three new 2-quads from the component 1-quads of six other 2-quads.

Writing the code this way does not make it slower! Value types are not boxed in C#, and the jitter is smart about inlining operations like this. (And if it isn’t, then it is a lot easier to speed up code if it is written so that its business semantics are clear.)

Next time on FAIC: What is the step-forwards-one-tick algorithm for a 2-quad? It’s only 16 cells; how hard could it be?




Life, part 22

Code for this episode is here.

So far in this series every algorithm we’ve attempted has been either O(cells) or O(changes) in time, and O(cells) in space. Going with a straightforward “big square with dead cells surrounding it” approach tends to use up a lot of space as the side of the square grows larger.

Is there a way to make the space usage asymptotically smaller, say, O(living cells)? Then we could have arbitrarily large boards and know that we are only paying for the cells that are alive. Of course it would certainly be nice if the step time was still O(changes) while we’re at it.

Before I get into that though, a quick note about the UI; I’ve added the ability for the different Life engines to report statistics to the UI such as generation count, number of changed cells on the previous generation, or anything else that might be of interest while watching a pattern evolve. If you’re playing around with the engines, consider using this rather simple new feature to try to gain some insight into their behaviours.

The code is straightforward; if a Life engine implements an interface IReport then IReport.Report is called every time the UI updates; easy peasy.

After writing today’s implementation I came across this charmingly straightforward similar Life implementation in Python, on Madelyn Eriksen’s blog. As Madelyn points out, rather than going with a “dense” array that is over 50% dead cells almost always, we could instead use a “sparse” array; that is, simply make a collection of the coordinates of the living cells, and spend no space at all representing the empty ones.

That’s O(living cells) in memory size, obviously. But what about the time cost? The Python implementation given is also O(living cells) per tick, not O(changes). The implementation of a sparse array algorithm that is O(living cells) in space and O(changes) in time is straightforward so let’s just go for it!

Our basic data structures are: a set of living cells, a set of recently born cells, and a set of recently died cells:

class SparseArray : ILife, IReport
  private HashSet<(long, long)> living;
  private HashSet<(long, long)> recentBirths;
  private HashSet<(long, long)> recentDeaths;
  private int generation;
  public SparseArray() 
  public void Clear() 
    living = new HashSet<(long, long)>(); 
    recentBirths = new HashSet<(long, long)>(); 
    recentDeaths = new HashSet<(long, long)>(); 
    generation = 0; 

Nothing at all surprising about that.

Our getter and setter indexers can, for the first time in this series, skip checking whether the coordinates are valid:

  public bool this[long x, long y]
    get => living.Contains((x, y));
      if (value)
        if (living.Add((x, y)))
          recentBirths.Add((x, y));
        if (living.Remove((x, y)))
          recentDeaths.Add((x, y));

And now our step algorithm is just as you’d expect: for each previously-changed cell, check to see if it or any of its neighbours need to change on this tick. If so, record them into brand-new “recent births” and “recent deaths” sets that will be the change list for the next generation.

Updating the living set with the changes is two built-in set operations.

  public void Step()
    var previousBirths = recentBirths;
    recentBirths = new HashSet<(long, long)>();

    var previousDeaths = recentDeaths;
    recentDeaths = new HashSet<(long, long)>();
    foreach ((long x, long y) in previousBirths) 
      CheckCellAndNeighbours(x, y); 
    foreach ((long x, long y) in previousDeaths) 
      CheckCellAndNeighbours(x, y); 
    generation += 1; 

And of course our “check the cell and its neighbours” function is very similar to code we’ve seen several times already in this series:

  private void CheckCellAndNeighbours(long x, long y)
    for (int iy = -1; iy < 2; iy += 1)
      long cy = y + iy;
      for (int ix = -1; ix < 2; ix += 1)
        long cx = x + ix;
        bool state = this[cx, cy];
        int nw = this[cx - 1, cy - 1] ? 1 : 0;
        int n = this[cx, cy - 1] ? 1 : 0;
        int ne = this[cx + 1, cy - 1] ? 1 : 0;
        int w = this[cx - 1, cy] ? 1 : 0;
        int e = this[cx + 1, cy] ? 1 : 0;
        int sw = this[cx - 1, cy + 1] ? 1 : 0;
        int s = this[cx, cy + 1] ? 1 : 0;
        int se = this[cx + 1, cy + 1] ? 1 : 0;
        int count = nw + n + ne + w + e + sw + s + se;
        if (state & count != 2 & count != 3)
          recentDeaths.Add((cx, cy));
        else if (!state & count == 3)
          recentBirths.Add((cx, cy));

An interesting point about this implementation: I said near the start of this series that I was not going to implement “wrap around” semantics, but I suppose I have here; since I have no bounds checks on the coordinates at all, we’ll “wrap around” if we go over the maximum value of a long. I’m not too worried about it.

That was a very short and sweet implementation compared to the bit-twiddly, repetitive, table-driven code we had to write for Stafford’s algorithm! But I want to consider the memory and time performance of this algorithm.

The memory performance is asymptotically pretty good; plainly the memory used is proportional to the number of living cells, which we have not seen before in this series. However we should also consider that proportion. For Abrash’s algorithm we had one byte per cell, living or dead (though we only used six of the eight bits). For Stafford’s algorithm we similarly used 15 bits of a 16 bit short to store three cells, so that’s five bits per cell. As I mentioned last time, David Stafford emailed me and described some schemes whereby one could compress that even further to store more neighbour counts and cell state in fewer bits.

This scheme uses a massive 128 bits — two longs — per living cell! Now, it uses nothing at all for the dead cells, which is great, but still, 128 is a lot. And that’s not counting all the overhead of maintaining the internal data structures of the hash set, which are also O(living cells) additional burden of unknown size. (I mean, I could work it out by reading the hash set source code, but I’m not going to.)

The up side of course is we get the full two-to-the-128 cells available in our 64-quad, but it seems like we are using a bunch of memory here. There are things we could do about that. We could have a (short, short) dictionary for all the cells inside a 16-quad, for example, which seems pretty reasonable, and now we’re down to 32 bits per living cell instead of 128; we could then have a sparse array of 16-quads, and so on; you see how it goes I’m sure. This could get quite complex, but we could get it down to a relatively smaller number of bits per living cell if we tried.

What’s your intuition about the wall-clock performance of this algorithm on our standard performance problem of “5000 generations of acorn”? Give that some thought and then scroll down for the results to see if you were right.







Algorithm           time(ms)  ticks  size(quad)    megacells/s
Naïve (Optimized):   4000       5K      8               82
Abrash                550       5K      8              596
Abrash w/ changes     190       5K      8             1725 
Abrash, O(c)          240       5K      8             1365
Stafford              180       5K      8             1820
SparseArray          4000       5K     64                ?

We are back to the time performance of our original (slightly optimized) naïve array implementation! Once again we are doing a dozen set operations per step per change, just as we were doing dozens of array operations per cell in the naïve implementation. Those set operations are more expensive than array operations, so the fact that we are doing fewer operations by only looking at changed cells is balanced out by the increased cost per operation.

It’s now not clear what “megacells per second” means now that the board size is essentially unlimited.

We could do this entire series over again; instead of starting with regular rectangular arrays as our basic data structure, we could start with sets and dictionaries. We could have a dictionary mapping (long, long) to (count, bool) and update neighbour counts on changes as we did in Abrash’s algorithm; we could make a sparse array version of Stafford’s algorithm. We would make the memory burden of those algorithms proportional to the number of living cells, but in every case I am quite sure we would see a significant time performance penalty for going with hash sets over normal arrays.

Though it would be interesting to try and then compare the sparse vs dense array implementations, this series is already far longer than I thought it would be, and we still have at least two more implementations to get through; let’s move forward rather than revisiting ground we’ve already covered.

Next time on FAIC: We’ll look at a new idea for a lookup-table based Life algorithm.

For today’s extra content: a few episodes back we were talking about puffers, and I mentioned that the first puffer found leaves behind some blinkers and bookends:

Fun fact: if you add another couple of spaceships to run alongside puffer1, the debris becomes loafs, blocks and gliders!

This is a simplified version of the pattern “backrake 3”. It has a number of interesting uses, which we will get to in upcoming episodes.





Life, part 21

Last time on FAIC I said that we were done with Stafford’s algorithm, but right after that went up I received a thoughtful email from David Stafford himself; he then introduced me to Michael Abrash and Terje Mathisen, who was also a participant in Michael’s optimization challenge. I am happy to know that this series brought back good memories from the 1990s.

We had an email thread over the weekend discussing other approaches that they considered taking for this challenge, some of which were quite ingenious.

As I mentioned a few episodes back, the key insight that makes it possible to get three cells and their neighbor counts into a 16 bit integer is that we don’t need four bits to represent the neighbour count, even though that count goes from 0 to 8. David wondered if there was a way to fit even more information into a small sack of bits and there is!

  • If we have four cells in a short and we arrange the cells in a square — a “1-quad” in the jargon I’m using for this series — then each cell has exactly five “outside neighbours”, instead of the triplet approach where two have seven “outside neighbours” and one has six. Each of the four cells has a living neighbour count from 0 to 5 which takes three bits, so that is twelve bits total for the counts.
  • Can the counts be less than twelve bits? I pointed out a few episodes back that not every triplet is possible to observe in a real board; you can’t have a life grid where a “middle” cell has eight living neighbors but the “left” and “right” cells have no living neighbours, for instance. The same is true of this quad data structure; there are not 4096 configurations of living neighbour counts that actually occur, so we are wasting bits. In fact there are only 472 possible configurations; we can number them and that number fits into nine bits!
  • In the triplet approach we are storing the “next state” and the “current state” in the same short. Suppose the cell is not changing; the next state and the current state will be the same. We are spending O(board size) memory in the form of three bits per triple to track changes that mostly do not happen. We have an O(changes) data structure already: the change list! Put the “next state” in the change list, not the quad, and use the change list to update the array of quads after the change list is computed.

Put it all together and you can easily fit four cells and their neighbour counts into a 16 bit short with room to spare for things like “am I on an edge?” and so on. The tables lookups and change list construction would need to be reworked slightly but the mechanisms are basically the same. This algorithm would do four cells in parallel instead of three, and consume less memory because we’re down to 16 bits per four cells, instead of per three.

I’m not going to implement it, but it would totally work and would be quite a bit faster than what we’ve got now.

Terje also came up with a four-bits-per-cell algorithm, but did not use a change list; he described a SIMD approach using a variation on Scholes’ Algorithm as well. The fact that those algorithms are O(board size) and not O(changes) meant they were not winners though.

Michael mentioned that he considered approaches for identifying “bounding boxes” around active regions surrounded by empty space, and fracturing the larger board into smaller boards, each of which can be computed using whatever algorithm is best for that size. However, identifying bounds, growing arrays, splitting and merging active regions, that is all really quite complicated.

There were a bunch more ideas but I think I’ll leave it there for now. Thanks to all involved for a lovely conversation.

Next time on FAIC: thus far every attempt we’ve made has been some variation on “make a big square array of cells”; even if the stepping algorithm is O(changes), the memory burden is still O(cells), which makes it hard to make really big boards; a megabyte here, a megabyte there and soon you are talking about a lot of memory! There’s a completely different approach that is very simple, but what is the time performance cost? And is it really an improvement in memory cost?

For today’s extra content: we know that a spaceship moves itself through empty space. A “fuse” is like a spaceship but it requires a particular kind of non-empty space to move through; it typically destroys or changes the medium that it “burns”. Here’s an example of a “blinker fuse” from the wiki; you give it any number of blinkers in a row as illustrated, and it will destroy all of them eventually.


Life, part 20

In today’s episode I want to again pause for a moment — this time, to verify that our allegedly O(change) implementation of Stafford’s algorithm really does have its performance gated on the number of cells changing in a tick.

Here’s my new performance testing scenario:

I’ve modified my implementation of Stafford’s algorithm to make an 11-quad; that is a 2048×2048 board. I’ve then tiled the bottom edge with 50 glider guns, each of which produces a new glider every 30 ticks.

Gliders move at c/4, so the leftmost glider gun’s first glider will reach the top right corner in about 8200 ticks; the rightmost glider gun’s first glider will reach the right edge almost immediately, and every other glider gun will be somewhere in between.

The result is that the number of changed triplets processed per tick starts off as a linear function which then gradually slopes down to become constant around 8200 ticks; at 8200 we are on average losing as many gliders to the rectangle of death as we are creating:

If our algorithm really is O(changes) then we should expect the number of ticks processed per second should go down where the number of changes is increasing, and should level off where changes levels off, and in fact that is what we see:

Or, put another way, we should expect that changes per millisecond should be roughly constant:

It’s a little noisy but I did not go to the trouble of setting up a noise-free performance testing environment! We can see that the trend here certainly looks like it steadies out at around 7000 triplet changes processed per millisecond.

Like I said before, we could continue to optimize our implementation of Stafford’s algorithm by profiling, finding the slowest part, micro-optimizing it, and so on; reader Mark Pflug reports that you can get a 20% marginal improvement by going to pointers instead of 2-d array accesses, for instance. I want this series to be more about the algorithms than the small details, so I think we are finally done with Stafford’s algorithm.

(UPDATE: We are not quite done with Stafford’s algorithm! Five minutes after I hit publish on this article I received a very kind email from David Stafford; I will discuss some further thoughts briefly in the next episode.)

However, I do want to make a few observations about the algorithms we’ve seen so far and the performance gains we’ve achieved in the context of this test case.

  • O(changes) is clearly better than O(cells), but even so, the number of changing cells can grow linearly, and on a fixed-size board can be a considerable fraction of that board. (And I still haven’t said whether there are patterns where the number of changes per tick grows super-linearly! What’s your intuition there?)
  • Of the four million cells in this board, almost half of them were dead cells never even close to a cell that ever changed or ever would change, but we spent over 600 kilobytes to hold the values of those dead cells.
  • Once again we see that even with a board of four million cells, we run into the rectangle of death in only 8200 ticks, which is not that many in the grand scheme of things.

It seems like we’re going to have difficulty scaling up to larger boards with patterns that we would like to run for a long time to understand their behaviours. Stafford’s algorithm fits four million cells into 2.7 million bytes, and sure, we’ve got machines with plenty of memory these days, but it seems like we could be doing better.

Our insight for optimizing time was “most cells stay the same, so only process the changes”. Can we apply a similar insight to optimizing space? Thus far every algorithm we’ve considered has been O(cells) in memory; could we observe “most cells are dead” and come up with an O(living cells) memory burden? And would that enable us to expand beyond these small, fixed-size “rectangle of death” solutions?

Next time on FAIC: A new, much simpler algorithm that is O(changes) in time and O(living) in space, but at what cost in constant factor?

For today’s extra content: Sometimes you want to change the direction of a glider. Ideally a configuration which changes the direction of an incoming glider should repair itself after the interaction with the glider so as to be ready for the next glider that needs reflecting. The number of ticks it takes to do so is the “repeat time” of the reflector. Even more ideally, the reflector should be a still Life; such reflectors are called “stable reflectors”.

The stable reflector with the smallest known repeat time — 43 ticks — and size is the Snark; it was discovered in 2013 after a long hunt by Mike Playle. Image from the wiki; see that page for an animation of the Snark in action.



Life, part 19

Code for this episode is here.

Today we can finish off our C# implementation of Stafford’s algorithm. Last time we turned the first pass into a table lookup; it might be a bit harder to optimize the second pass. Let’s take a look at the inner loop of the second pass; I’ll take out the comments and the debugging sanity checks:

Triplet t = triplets[x, y];
if (!t.Changed)
if (t.LeftNext)
  BecomeAlive(x * 3, y);
  BecomeDead(x * 3, y);
if (t.MiddleNext)
  BecomeAlive(x * 3 + 1, y);
  BecomeDead(x * 3 + 1, y);
if (t.RightNext)
  BecomeAlive(x * 3 + 2, y);
  BecomeDead(x * 3 + 2, y);
changes.Add((x, y));

The key to understanding why this is very suboptimal is to look at what the code actually does in a common case. Let’s suppose the “current” triplet state is “dead-alive-alive” and the “next” triplet state is “alive-alive-dead”. That is, the left cell is going to become alive, the middle cell is going to stay alive, and the right cell is going to die.  What happens? Let’s go through it line by line.

if (!t.Changed)

Do some bit twiddling to discover that the current state and the next state are in fact different.

if (t.LeftNext)

Do some bit twiddling to discover that the next state is “alive”. Call BecomeAlive. What does it do?

if (t.LeftCurrent) return false;

Do some more bit twiddling to discover that yes, a change needs to be made.

triplets[tx - 1, y - 1] = triplets[tx - 1, y - 1].UUP();
triplets[tx, y - 1] = triplets[tx, y - 1].PPU();
triplets[tx - 1, y] = triplets[tx - 1, y].UUP();
triplets[tx, y] = t.SetLeftCurrent(true);
triplets[tx - 1, y + 1] = triplets[tx - 1, y + 1].UUP();
triplets[tx, y + 1] = triplets[tx, y + 1].PPU();

Do some bit twiddling to set the current bit to alive; do five additions to five neighbouring triplets to update their living neighbour counts — in parallel, for the cases where there are two counts to update — and return to the inner loop:

if (t.MiddleNext)

Do some bit twiddling to discover that the middle cell “next” state is alive, and call BecomeAlive:

if (t.MiddleCurrent) return false;

Do more bit twiddling to discover that the middle cell is already alive, so there’s no work to do. Go back to the inner loop:

if (t.RightNext)

Do some bit twiddling to discover that the right cell “next” state is dead; call BecomeDead:

if (!t.RightCurrent) return false;

Do some bit twiddling to discover that it is currently alive and so neighbour counts need to be updated.

triplets[tx, y - 1] = triplets[tx, y - 1].UMM();
triplets[tx + 1, y - 1] = triplets[tx + 1, y - 1].MUU();
triplets[tx, y] = t.SetRightCurrent(false);
triplets[tx + 1, y] = triplets[tx + 1, y].MUU();
triplets[tx, y + 1] = triplets[tx, y + 1].UMM();
triplets[tx + 1, y + 1] = triplets[tx + 1, y + 1].MUU();

Twiddle bits to set the current bit to dead; update the neighbour counts of five neighbouring triplets, two of which we just updated when we did the left cell!

Also, did you notice that when we did the left cell, we added one to the middle neighbour counts of the above and below triplets, and just now we subtracted one from that same count? We are doing work and then undoing it a few nanoseconds later.

The observations to make here are:

  • On the good side, the idempotency check at the top of the algorithm is quite cheap. The cost of having duplicates on the “current changes” list is really low. But everything else is terrible.
  • We are doing way too much bit twiddling still, and it is almost all in the service of control flow. That is, most of those bit twiddles are all in “if” statements that choose a path through the algorithm to selectively run different operations.
  • If the left and right cells both changed then we need to update all eight neighbouring triplets, but we actually do ten updates.
  • Worse, if the left, right and middle cells changed, we need to update eight neighbouring triplets but we actually would do twelve updates!
  • We twiddle bits to set the current state up to three times, but we know what the final current state is going to be; it is the same as the “next” state.

How can we fix these problems?

The thing to notice here is there are only 27 possible control flows through this algorithm.They are:

  • The idempotency check says that we already processed this change. Continue the loop.
  • The left and middle cells are unchanged; the right cell is going from dead to alive.
  • The left and middle cells are unchanged; the right cell is going from alive to dead.
  • The left and right cells are unchanged; the middle cell is going from dead to alive.
  • … and so on; you don’t need me to list all 27.

Let’s come up with a notation; I’ll say that “A” is “was dead, becomes alive”, “D” is “was alive, becomes dead”, and “U” is “no change”. The 27 possible control flows are therefore notated UUU, UUA, UUD, UAU, UAA, UAD, UDU, UDA, UDD, and so on, up to DDD. There are three positions each with three choices, so that’s three-to-the-three = 27 possibilities.

Since there are only 27 control flows, let’s just make a helper method for each! The control flow I just spelled out would be notated AUD, and so:

private bool AUD(int tx, int y)
  triplets[tx - 1, y - 1] = triplets[tx - 1, y - 1].UUP();
  triplets[tx, y - 1] = triplets[tx, y - 1].PUM();
  triplets[tx + 1, y - 1] = triplets[tx + 1, y - 1].MUU();
  triplets[tx - 1, y] = triplets[tx - 1, y].UUP();
  triplets[tx + 1, y] = triplets[tx + 1, y].MUU();
  triplets[tx - 1, y + 1] = triplets[tx - 1, y + 1].UUP();
  triplets[tx, y + 1] = triplets[tx, y + 1].PUM();
  triplets[tx + 1, y + 1] = triplets[tx + 1, y + 1].MUU();
  return true;

There are eight neighbouring triplets that need updating; we update all of them exactly once. We will see why it returns a bool in a minute.

You might be wondering what happens for something like “DDU” — left dies, middle dies, right unchanged. In that case there are neighbouring cells in other triplets that need to have their counts adjusted by two. We could write:

private bool DDU(int tx, int y)
  triplets[tx - 1, y - 1] = triplets[tx - 1, y - 1].UUM();
  triplets[tx, y - 1] = triplets[tx, y - 1].MMM().MMU();

But we notice here that we could save an addition by again, making the compiler fold the constant math:

private bool DDU(int tx, int y)
  triplets[tx - 1, y - 1] = triplets[tx - 1, y - 1].UUM();
  triplets[tx, y - 1] = triplets[tx, y - 1].M2M2M();

where we define a helper on the triplet struct:

public Triplet M2M2M() =>
  new Triplet(-2 * lcountone - 2 * mcountone - rcountone + triplet);

We can implement all 27 methods easily enough and we need 12 new helper methods to do the additions. These methods are all tedious, so I won’t labour the point further except to say that method UUU — no change — is implemented like this:

public bool UUU(int tx, int y) => false;

Meaning “there was no change”. All the other methods return true.

So it is all well and good to say that we should make one method for every workflow; how do we dispatch those methods?

We observe that there are only 64 possible combinations of current state and next state, and each one corresponds to one of these 27 workflows, so let’s just make another lookup; this one is a lookup of functions.

We will treat the current state as the low three bits of a six-bit integer, and the next state as the high three bits; that gives us our 64 possibilities. If, say, the next state is alive-alive-dead (AAD) and the current state is dead-alive-alive (DAA) then the change is “become dead-unchanged-become alive” (DUA):

lookup2 = new Func<int, int, bool>[]
  /* NXT CUR */
  /* DDD DDD */ UUU, /* DDD DDA */ UUD, /* DDD DAD */ UDU, /* DDD DAA */ UDD,
  /* DDD ADD */ DUU, /* DDD ADA */ DUD, /* DDD AAD */ DDU, /* DDD AAA */ DDD,
  /* DDA DDD */ UUA, /* DDA DDA */ UUU, /* DDA DAD */ UDA, /* DDA DAA */ UDU,
  /* DDA ADD */ DUA, /* DDA ADA */ DUU, /* DDA AAD */ DDA, /* DDA AAA */ DDU,
  /* DAD DDD */ UAU, /* DAD DDA */ UAD, /* DAD DAD */ UUU, /* DAD DAA */ UUD,
  /* DAD ADD */ DAU, /* DAD ADA */ DAD, /* DAD AAD */ DUU, /* DAD AAA */ DUD,
  /* DAA DDD */ UAA, /* DAA DDA */ UAU, /* DAA DAD */ UUA, /* DAA DAA */ UUU,
  /* DAA ADD */ DAA, /* DAA ADA */ DAU, /* DAA AAD */ DUA, /* DAA AAA */ DUU,
  /* ADD DDD */ AUU, /* ADD DDA */ AUD, /* ADD DAD */ ADU, /* ADD DAA */ ADD,
  /* ADD ADD */ UUU, /* ADD ADA */ UUD, /* ADD AAD */ UDU, /* ADD AAA */ UDD,
  /* ADA DDD */ AUA, /* ADA DDA */ AUU, /* ADA DAD */ ADA, /* ADA DAA */ ADU,
  /* ADA ADD */ UUA, /* ADA ADA */ UUU, /* ADA AAD */ UDA, /* ADA AAA */ UDU,
  /* AAD DDD */ AAU, /* AAD DDA */ AAD, /* AAD DAD */ AUU, /* AAD DAA */ AUD,
  /* AAD ADD */ UAU, /* AAD ADA */ UAD, /* AAD AAD */ UUU, /* AAD AAA */ UUD,
  /* AAA DDD */ AAA, /* AAA DDA */ AAU, /* AAA DAD */ AUA, /* AAA DAA */ AUU,
  /* AAA ADD */ UAA, /* AAA ADA */ UAU, /* AAA AAD */ UUA, /* AAA AAA */ UUU

We can extract the key to this lookup table from a triplet with a minimum of bit twiddling:

public int LookupKey2 => triplet >> 9;

And we need one final bit twiddle to set the current state to the next state:

public Triplet NextToCurrent() => 
  new Triplet((triplet & ~currentm) | ((triplet & nextm) >> 3));

Put it all together and our second pass loop body is greatly simplified in its action, though rather harder to understand what it is doing:

int key2 = triplets[x, y].LookupKey2;
Func<int, int, bool> helper = lookup2[key2];
bool changed = helper(x, y);
if (changed)
  triplets[x, y] = triplets[x, y].NextToCurrent();
  changes.Add((x, y));

We determine what work needs to be done via table lookup; if the answer is “none”, the helper returns false and we do nothing. Otherwise the helper does two to eight additions, each updating one to three neighbour counts in parallel. Finally, if necessary we update the current state to the next state and make a note of the changed triplet for next time.

We’re down to a single control flow decision, and almost all the bit operations are simple additions, so we will likely have considerably improved performance.

It was a long road to get here. I sure hope it was worth it…

Algorithm           time(ms)  ticks  size(quad)    megacells/s
Naïve (Optimized):   4000       5K      8               82
Abrash                550       5K      8              596
Abrash w/ changes     190       5K      8             1725 
Abrash, O(c)          240       5K      8             1365
Stafford v1           340       5K      8              963
Stafford v2           210       5K      8             1560
Stafford final        180       5K      8             1820

A new record for this series, so… yay?

Also, it is still an O(change) algorithm, and it is significantly more space-efficient at three cells per two bytes, instead of one cell per byte, so that’s good; we can store larger grids in the same amount of memory if we want.

But oh my goodness, that was a lot of work to get around a 25% speedup over the one-byte-per-cell O(change) algorithm.

Reading Abrash’s original article from the 1990s describing Stafford’s algorithm, it is a little difficult to parse out exactly how fast it is, and how much faster that is than Abrash’s original algorithm. If I read it right, the numbers are approximately:

  • 200×200 grid of cells
  • 20MHz-33MHz 486
  • Measure performance of the computation, not the display (as we are also doing)
  • Roughly 20 ticks per second for Abrash’s O(cells) “maintain neighbour counts” algorithm.
  • Roughly 200 ticks per second for Stafford’s O(changes) implementation.

Dave Stafford attained a 10x speedup by applying optimizations:

  • Store the next state, current state, and neighbour counts in a triplet short
  • Maintain a deduplicated change list of previously-changed triplets
  • Table lookup to compute next state of possibly-changed triplet
  • Branch to a routine that does no more work than necessary to update neighbour counts

Obviously machines are a lot faster 25 years later. Stafford’s algorithm running on a 20MHz 486 does 200 ticks per second, and the same algorithm written in C# running on a 2GHz i7 does over 25000 ticks per second even with a 50%+ larger grid size. But we should expect that a machine with 100x clock speed, better processor caches, and so on, should do on the order of 100x more work per second.

(Aside: though we are doing 65536 cells in a grid instead of their 40000, Abrash and Stafford’s implementations also implemented wrap-around behaviour for the edges, which I have not; that complicates their implementations and I avoided the penalty of writing code to deal with those complications. So this is not a purely apples-to-apples comparison in many ways.)

What is a little surprising here is: my implementation of the two algorithms in C# shows a 3x speedup between the two algorithms that were originally compared, not a 10x speedup as was original observed.

Figuring out why exactly would require rather more analysis of the original code than I’m willing to do, but I have a few conjectures.

First, we don’t know what the test case was. If it had a small number of changes per tick then certainly an O(changes) algorithm is potentially much better than an O(cells) implementation. Perhaps the test case was biased towards O(changes) algorithms.

Second, I have implemented the spirit of both algorithms but there are a lot of differences. My favourite, by way of example, is that Stafford’s original machine code implementation does not use a lookup array for the second-pass optimization, like I do here. The triplet value itself (with some bits masked out) is the address of the code that is invoked. Let me repeat that in case it was not clear: it’s machine code so you have total authority to decide where in memory the helper functions are, and they are placed in memory such that the triplet itself is a pointer to the helper function. There is no table indirection penalty because there is no table.

These are the sorts of tricks that machine language programmers back in the 1990s lived for, and though I can admire it, that’s just not how we do stuff in high level languages. We could spend a bunch of time tweaking the C# versions of these algorithms to use pointers instead of offsets, or arrays with high water marks instead of lists, and so on. We’d get some wins, but they’d be marginal, and I’m not going to bother tuning this further. (As always, if anyone wants to, please make the attempt and leave a comment!)

The third conjecture is: yes, the rising tide of a 100x processor speed improvement lifted all boats, but it also changed the landscape of performance optimization because not all code is gated on processor speed. RAM latency, amount of available RAM, cache sizes, and so on, have also changed in that time but not at the same rate, and not to the same effect. Figuring out whether any of those changes had the effect of bringing these two algorithms’ performance closer together in the last 25 years is really hard to say.

Next time on FAIC: I want to do some experiments to look more directly at the scalability of these algorithms; do costs actually grow as O(change)? What happens if we make some larger grids where lots of stuff is changing?

Coming up after that: Every algorithm we’ve looked at so far uses some sort of two-dimensional array as the backing store, which limits us to grids that are, in the grand scheme of things, pretty small. Can we come up with storage solutions that are less constrained, and at what performance cost?

For today’s extra content: a puffer whose debris is spaceships is called a rake; the first rake discovered back in the 1970s is called spacerake. (Image from the wiki.)

Again we see that you can do interesting things by sticking a bunch of c/2 spaceships together. Here’s a wider view snapshot that maybe makes it a little more clear how it comes to resemble “raking” the plane:

And again, this is an example of a pattern which causes the number of changes per tick to grow linearly over time.

Life, part 18

Code for this episode is here.

A couple episodes back we did a first cut at implementing Stafford’s algorithm using the triplet data structure to store the current and upcoming states of three cells and their living neighbour counts, all in 15 bits of a 16 bit short. Unsurprisingly, it was quite a bit slower than the same algorithm that did one byte per cell; any win from parallelism is swamped by, I conjectured, all the extra bit twiddling.

Is there a way to eliminate some of that bit twiddling?

Today we’ll look at the inner loop of the first pass, which identifies cells that need to change by setting their next state bits:

Triplet c = triplets[x, y];
Triplet t = c;
int lc = t.LeftCount;
int mc = t.MiddleCount;
int rc = t.RightCount;
t = t.SetLeftNext(lc == 3 | t.LeftCurrent & lc == 2);
t = t.SetMiddleNext(mc == 3 | t.MiddleCurrent & mc == 2);
t = t.SetRightNext(rc == 3 | t.RightCurrent & rc == 2);
if (t.Changed)
  currentChanges.Add((x, y));
  triplets[x, y] = t;

What essentially are we doing here?

  • Compute the new triplet t from the old triplet c.
  • If t is different than c, cause two side effects: update an array, and update a list.

There’s nothing we can do about the side effects, but we can observe that the computation has the following characteristics:

  • We compute the left, middle and right actual counts; each does bit twiddling to extract one or two states, and one three-bit integer, and then some addition.
  • We then do bit twiddling to again extract the three states…
  • and then we do a bunch of comparison logic on all of the above
  • The net result is the three “next state” bits, which we again use bit twiddling to get into place in the new triplet.
  • And, cherry on top, we do a bunch of bit twiddling to extract the current and next state bits to ask “was there a change?”
  • Oh, the pain.

We have a function that consumes 12 bits of a 16 bit short and produces a new 16 bit short and a Boolean saying whether it changed. But given those 12 bits, the short and the Boolean produced are always the same. We can precompute the results for every possible input once and look them up as our inner loop.

That is, we can move the vast majority of the bit twiddling backwards in time to before the algorithm runs the first step.

We will need a lookup key containing the twelve bits we need; fortunately, they are the bottom twelve bits so we can extract them with a minimum of bit twiddling by adding a helper to the triplet struct:

public int LookupKey1 => triplet & 0x0fff;

I’ll create two lookups, one for “will there be a change?” and one for “what is the new triplet?” Since the lookups are keyed on a 12-bit integer, we can simply make them both arrays; there’s no need for anything fancier:

static class TripletLookup
  public static Triplet[] lookup;
  public static bool[] changed;

and initialize them in a static constructor; we just create every possible triplet based on the bottom 12 bits, and compute what the inner loop of the first pass did in the previous version:

  static TripletLookup()
    lookup = new Triplet[1 << 12];
    changed = new bool[1 << 12];
    for (int left = 0; left < 2; left += 1)
      for (int middle = 0; middle < 2; middle += 1)
        for (int right = 0; right < 2; right += 1)
          for (int lc = 0; lc < 8; lc += 1)
            for (int mc = 0; mc < 7; mc += 1)
              for (int rc = 0; rc < 8; rc += 1)
                Triplet t = new Triplet()
                  .SetLeftCurrent(left == 1)
                  .SetMiddleCurrent(middle == 1)
                  .SetRightCurrent(right == 1)
                    (lc + middle == 3) | (left == 1) & (lc + middle == 2))
                     (left + mc + right == 3) | (middle == 1) & (left + mc + right == 2))
                    (middle + rc == 3) | (right == 1) & (middle + rc == 2));
                lookup[t.LookupKey1] = t;
                changed[t.LookupKey1] = t.Changed;

And that’s that.

A few things to note here. First, honestly, how often are you justified in writing a six-deep nested loop? Not very often, so let’s take the opportunity while we’ve got it!

Second, a great many of these lookup conditions are impossible. In Life you cannot get into a situation where the middle cell of a triplet has seven living neighbours that are not left or right, AND left and right both have zero living neighbours other than the middle. Who cares? These two tables take up 12KB of memory, total. We can waste some. And the cost of doing the unnecessary calculations is only paid once, at startup.

Moreover, in Stafford’s original implementation he went even further and did not bother to pull out the twelve bits for the key; he just precomputed the entire triplet-to-triplet function and made a table with 65536 entries. Why do any bit twiddling at all?

Anyway, we can now replace the entire inner loop of the first pass with:

int key1 = triplets[tx, y].LookupKey1;
if (TripletLookup.changed[key1])
  triplets[tx, y] = TripletLookup.lookup[key1];
  currentChanges.Add((tx, y));

So much nicer to read, and much faster. Let’s once again run 5000 generations of acorn on an 8-quad:

Algorithm           time(ms)  ticks  size(quad)    megacells/s
Naïve (Optimized):   4000       5K      8               82
Abrash                550       5K      8              596
Abrash w/ changes     190       5K      8             1725 
Abrash, O(c)          240       5K      8             1365
Stafford v1           340       5K      8              963
Stafford v2           210       5K      8             1560

We are back in business! We have a three-cells-per-short O(c) algorithm that beats the one-cell-per-byte O(c) algorithm and is within striking distance of beating the “copy the array every time” version.

Aside: I told a small fib earlier in this episode; did you catch it?

Give that some thought and then scroll down.








…we can move the vast majority of the bit twiddling backwards in time to before the algorithm runs the first step.

I put my fib in boldface for added foolery.

When exactly does the lookup initialization happen? I worked on the compiler team and the language design team and I still always have to double-check Jon’s page on the subject whenever it comes up.

Since we have a static class with a static constructor, the static constructor will run when a member is accessed for the first time; we are not using the “relaxed” semantics. That means: the cost of precomputing the tables is incurred during the first call to Step, not before. This could make a difference to our performance metrics!

It does not, because when I collected those metrics I ensured that Step had already been called at least once before I started the stopwatch. But I had to know to do that.

Incidentally, the purpose of the “relaxed” semantics is to allow the initialization of the static fields to happen when the first method that accesses those fields is jitted. That way the jitter does not have to emit code to ensure that the fields are initialized; it just initializes them and then it knows they are initialized! I’ve always found it a little ironic that the “relaxed” semantic means that the jitter gets more eager about calling the initializers early. More eagerness seems like an odd thing to characterize as “relaxed”.

Next time on FAIC: Optimizing the first pass was pretty straightforward. Optimizing the second pass is maybe not so easy. We will do so, and finish up our exploration of Stafford’s algorithm.

For today’s extra content: last time I mentioned that a common technique is to use eaters to constrain unwanted expansion of some pattern in an oscillator; this has been used for a long time, as evidenced by two oscillators discovered in 1972 by Robert Wainwright: dinner table, and roteightor. (Images from wiki; see links.)