Code for this episode is here.
A couple episodes back we did a first cut at implementing Stafford’s algorithm using the triplet data structure to store the current and upcoming states of three cells and their living neighbour counts, all in 15 bits of a 16 bit short. Unsurprisingly, it was quite a bit slower than the same algorithm that did one byte per cell; any win from parallelism is swamped by, I conjectured, all the extra bit twiddling.
Is there a way to eliminate some of that bit twiddling?
Today we’ll look at the inner loop of the first pass, which identifies cells that need to change by setting their next state bits:
Triplet c = triplets[x, y]; Triplet t = c; int lc = t.LeftCount; int mc = t.MiddleCount; int rc = t.RightCount; t = t.SetLeftNext(lc == 3 | t.LeftCurrent & lc == 2); t = t.SetMiddleNext(mc == 3 | t.MiddleCurrent & mc == 2); t = t.SetRightNext(rc == 3 | t.RightCurrent & rc == 2); if (t.Changed) { currentChanges.Add((x, y)); triplets[x, y] = t; }
What essentially are we doing here?
- Compute the new triplet t from the old triplet c.
- If t is different than c, cause two side effects: update an array, and update a list.
There’s nothing we can do about the side effects, but we can observe that the computation has the following characteristics:
- We compute the left, middle and right actual counts; each does bit twiddling to extract one or two states, and one three-bit integer, and then some addition.
- We then do bit twiddling to again extract the three states…
- and then we do a bunch of comparison logic on all of the above
- The net result is the three “next state” bits, which we again use bit twiddling to get into place in the new triplet.
- And, cherry on top, we do a bunch of bit twiddling to extract the current and next state bits to ask “was there a change?”
- Oh, the pain.
We have a function that consumes 12 bits of a 16 bit short and produces a new 16 bit short and a Boolean saying whether it changed. But given those 12 bits, the short and the Boolean produced are always the same. We can precompute the results for every possible input once and look them up as our inner loop.
That is, we can move the vast majority of the bit twiddling backwards in time to before the algorithm runs the first step.
We will need a lookup key containing the twelve bits we need; fortunately, they are the bottom twelve bits so we can extract them with a minimum of bit twiddling by adding a helper to the triplet struct:
public int LookupKey1 => triplet & 0x0fff;
I’ll create two lookups, one for “will there be a change?” and one for “what is the new triplet?” Since the lookups are keyed on a 12-bit integer, we can simply make them both arrays; there’s no need for anything fancier:
static class TripletLookup { public static Triplet[] lookup; public static bool[] changed;
and initialize them in a static constructor; we just create every possible triplet based on the bottom 12 bits, and compute what the inner loop of the first pass did in the previous version:
static TripletLookup() { lookup = new Triplet[1 << 12]; changed = new bool[1 << 12]; for (int left = 0; left < 2; left += 1) for (int middle = 0; middle < 2; middle += 1) for (int right = 0; right < 2; right += 1) for (int lc = 0; lc < 8; lc += 1) for (int mc = 0; mc < 7; mc += 1) for (int rc = 0; rc < 8; rc += 1) { Triplet t = new Triplet() .SetLeftCurrent(left == 1) .SetMiddleCurrent(middle == 1) .SetRightCurrent(right == 1) .SetLeftCountRaw(lc) .SetMiddleCountRaw(mc) .SetRightCountRaw(rc) .SetLeftNext( (lc + middle == 3) | (left == 1) & (lc + middle == 2)) .SetMiddleNext( (left + mc + right == 3) | (middle == 1) & (left + mc + right == 2)) .SetRightNext( (middle + rc == 3) | (right == 1) & (middle + rc == 2)); lookup[t.LookupKey1] = t; changed[t.LookupKey1] = t.Changed; } }
And that’s that.
A few things to note here. First, honestly, how often are you justified in writing a six-deep nested loop? Not very often, so let’s take the opportunity while we’ve got it!
Second, a great many of these lookup conditions are impossible. In Life you cannot get into a situation where the middle cell of a triplet has seven living neighbours that are not left or right, AND left and right both have zero living neighbours other than the middle. Who cares? These two tables take up 12KB of memory, total. We can waste some. And the cost of doing the unnecessary calculations is only paid once, at startup.
Moreover, in Stafford’s original implementation he went even further and did not bother to pull out the twelve bits for the key; he just precomputed the entire triplet-to-triplet function and made a table with 65536 entries. Why do any bit twiddling at all?
Anyway, we can now replace the entire inner loop of the first pass with:
int key1 = triplets[tx, y].LookupKey1; if (TripletLookup.changed[key1]) { triplets[tx, y] = TripletLookup.lookup[key1]; currentChanges.Add((tx, y)); }
So much nicer to read, and much faster. Let’s once again run 5000 generations of acorn on an 8-quad:
Algorithm time(ms) ticks size(quad) megacells/s Naïve (Optimized): 4000 5K 8 82 Abrash 550 5K 8 596 Abrash w/ changes 190 5K 8 1725 Abrash, O(c) 240 5K 8 1365 Stafford v1 340 5K 8 963 Stafford v2 210 5K 8 1560
We are back in business! We have a three-cells-per-short O(c) algorithm that beats the one-cell-per-byte O(c) algorithm and is within striking distance of beating the “copy the array every time” version.
Aside: I told a small fib earlier in this episode; did you catch it?
Give that some thought and then scroll down.
.
.
.
.
.
.
.
…we can move the vast majority of the bit twiddling backwards in time to before the algorithm runs the first step.
I put my fib in boldface for added foolery.
When exactly does the lookup initialization happen? I worked on the compiler team and the language design team and I still always have to double-check Jon’s page on the subject whenever it comes up.
Since we have a static class with a static constructor, the static constructor will run when a member is accessed for the first time; we are not using the “relaxed” semantics. That means: the cost of precomputing the tables is incurred during the first call to Step, not before. This could make a difference to our performance metrics!
It does not, because when I collected those metrics I ensured that Step had already been called at least once before I started the stopwatch. But I had to know to do that.
Incidentally, the purpose of the “relaxed” semantics is to allow the initialization of the static fields to happen when the first method that accesses those fields is jitted. That way the jitter does not have to emit code to ensure that the fields are initialized; it just initializes them and then it knows they are initialized! I’ve always found it a little ironic that the “relaxed” semantic means that the jitter gets more eager about calling the initializers early. More eagerness seems like an odd thing to characterize as “relaxed”.
Next time on FAIC: Optimizing the first pass was pretty straightforward. Optimizing the second pass is maybe not so easy. We will do so, and finish up our exploration of Stafford’s algorithm.
For today’s extra content: last time I mentioned that a common technique is to use eaters to constrain unwanted expansion of some pattern in an oscillator; this has been used for a long time, as evidenced by two oscillators discovered in 1972 by Robert Wainwright: dinner table, and roteightor. (Images from wiki; see links.)
Pingback: The Morning Brew - Chris Alcock » The Morning Brew #3023
“We will need a lookup key comprised of the twelve bits we need” — please, no. I wouldn’t normally comment on this, but you’ve harped on “begs the question” in the past, so I presume you’d appreciate a little grammar reminder: the whole comprises the parts, while the parts compose the whole. The whole is NOT “comprised of” the parts, as you’ve written here.
(And yes, I’m aware that through common usage, these meanings have become blurred. But IMHO it’s still worth getting them right according to their actual meanings. We don’t need two different words that mean the same thing except that one sounds “fancier”…it’s much more useful to have two different words with two specifically different meanings.)
I had no idea; none of my copy editors over the years have ever pointed that out. Thanks for the note.
Happy to help. 🙂
I assume the blog software allows you to delete comments…please feel free to delete this thread if you like, as it’s served its purpose and does not seem to have much in the way of value for posterity. 🙂
“how often are you justified in writing a six-deep nested loop? Not very often, so let’s take the opportunity while we’ve got it!” — please forgive my amusement that, after all your fine and educational contributions on Stack Overflow in the area of Cartesian products, that here the implementation would go old-school with the nested loops. 🙂
As to “I’ve always found it a little ironic that the “relaxed” semantic means that the jitter gets more eager about calling the initializers early. More eagerness seems like an odd thing to characterize as “relaxed”.”
I know you know this, but of course the relaxation in question is the relaxation of the jitter’s requirement to generate its internal “bool AreStaticFieldsInitializedYet()” checks. Just like property getters for simple properties have more relaxed coding requirements than do getters for lazy-loaded properties, what with null checks, locks, etc. being needed. The jitter gets to relax; we don’t.
All of this is akin to what Raymond Chen calls “kernel-colored glasses”, the tendency for OS writers to view the world from the OS looking out, whereas the vastly more numerous users and customer developers view the same boundary from the outside world looking in to the OS. Which difference in perspective can lead to difficult APIs, confusing documentation, etc.
Writ larger, the takeaway is that most nouns embody a viewpoint, whether consciously or not. When looked at from a different viewpoint, many nouns seem inapt. As applied to coding, nouns include most namespace, class, interface, field, and property names. Good software is written (and named) from a single coherent and consciously selected point of view. Less-good software … well … we’ve all had the “privilege” of dealing with that stuff.
Let me add my thanks for your fascinating and skillful tour through COnway’s Life. Like others commenting in other posts I started devving when Life was new. In a sense I grew up with it. But I’d never delved into it beyond the surface. Reading your work is always an adventure and reading your code is always fabulous. Doubly so about a topic as adventurously fabulous as Life is. And as Conway was. Taken from us too soon; may he rest well.
That’s an insightful comment; indeed the classic joke is correct, naming things is one of the hard problems in computer programming. Even in “programming in the small” like this blog series I see ways in which I failed to pick your “single coherent point of view”. Why did I call the unit of time a “tick” but the method that executes one tick “Step”? No particularly good reason! And this was in a context where I was deliberately trying to make the code clear and the jargon precise.
I’m glad you’re enjoying the series, and thanks for your kind words.
The JIT uses beforefieldinit to initialize more eagerly, but for AOT it can make sense to defer initialization until a field is actually used.