Last time on FAIC I said that we were done with Stafford’s algorithm, but right after that went up I received a thoughtful email from David Stafford himself; he then introduced me to Michael Abrash and Terje Mathisen, who was also a participant in Michael’s optimization challenge. I am happy to know that this series brought back good memories from the 1990s.

We had an email thread over the weekend discussing other approaches that they considered taking for this challenge, some of which were quite ingenious.

As I mentioned a few episodes back, the key insight that makes it possible to get three cells and their neighbor counts into a 16 bit integer is that we don’t need four bits to represent the neighbour count, even though that count goes from 0 to 8. David wondered if there was a way to fit even more information into a small sack of bits and there is!

- If we have four cells in a short and we arrange the cells in a square — a “1-quad” in the jargon I’m using for this series — then each cell has exactly five “outside neighbours”, instead of the triplet approach where two have seven “outside neighbours” and one has six. Each of the four cells has a living neighbour count from 0 to 5 which takes three bits, so that is twelve bits total for the counts.
- Can the counts be less than twelve bits? I pointed out a few episodes back that not every triplet is possible to observe in a real board; you can’t have a life grid where a “middle” cell has eight living neighbors but the “left” and “right” cells have no living neighbours, for instance. The same is true of this quad data structure; there are not 4096 configurations of living neighbour counts that actually occur, so we are wasting bits. In fact
**there are only 472 possible configurations**; we can number them and that number fits into nine bits! - In the triplet approach we are storing the “next state” and the “current state” in the same short. Suppose the cell is not changing; the next state and the current state will be the same. We are spending O(board size) memory in the form of three bits per triple to track changes that mostly do not happen. We have an O(changes) data structure already: the change list! Put the “next state” in the change list, not the quad, and use the change list to update the array of quads after the change list is computed.

Put it all together and you can easily fit four cells and their neighbour counts into a 16 bit short with room to spare for things like “am I on an edge?” and so on. The tables lookups and change list construction would need to be reworked slightly but the mechanisms are basically the same. This algorithm would do four cells in parallel instead of three, and consume less memory because we’re down to 16 bits per four cells, instead of per three.

I’m not going to implement it, but it would totally work and would be quite a bit faster than what we’ve got now.

Terje also came up with a four-bits-per-cell algorithm, but did not use a change list; he described a SIMD approach using a variation on Scholes’ Algorithm as well. The fact that those algorithms are O(board size) and not O(changes) meant they were not winners though.

Michael mentioned that he considered approaches for identifying “bounding boxes” around active regions surrounded by empty space, and fracturing the larger board into smaller boards, each of which can be computed using whatever algorithm is best for that size. However, identifying bounds, growing arrays, splitting and merging active regions, that is all really quite complicated.

There were a bunch more ideas but I think I’ll leave it there for now. Thanks to all involved for a lovely conversation.

**Next time on FAIC:** thus far every attempt we’ve made has been some variation on “make a big square array of cells”; even if the stepping algorithm is O(changes), the memory burden is still O(cells), which makes it hard to make really big boards; a megabyte here, a megabyte there and soon you are talking about a lot of memory! There’s a completely different approach that is very simple, but what is the time performance cost? And is it *really* an improvement in memory cost?

**For today’s extra content**: we know that a spaceship moves itself through empty space. A “fuse” is like a spaceship but it requires a particular kind of non-empty space to move through; it typically destroys or changes the medium that it “burns”. Here’s an example of a “blinker fuse” from the wiki; you give it any number of blinkers in a row as illustrated, and it will destroy all of them eventually.