All right, we have our quad data structure, we know how to get and set individual elements, and we know how to display it. We’ve deduplicated it using memoization. How do we step it forward one tick? (Code for this episode is here.)

Remember a few episodes ago when we were discussing QuickLife and noted that if you have a 2-quad in hand, like these green ones, you can get the state of the blue 1-quad one tick ahead? And in fact we effectively memoized that solution by simply **precomputing all 65536 cases.**

The QuickLife algorithm memoized only the 2-quad-to-center-1-quad step algorithm; we’re going to do the same thing but with even more memoization. We have a recursively defined quad data structure, so it makes sense that the step algorithm will be recursive. We will use 2-quad-to-1-quad as our base case.

For the last time in this series, let’s write the Life rule:

private static Quad Rule(Quad q, int count) { if (count == 2) return q; if (count == 3) return Alive; return Dead; }

We’ll get all sixteen cells in the 2-quad as numbers:

private static Quad StepBaseCase(Quad q) { Debug.Assert(q.Level == 2); int b00 = (q.NW.NW == Dead) ? 0 : 1; ... 15 more omitted ...

and count the neighbours of the center 1-quad:

int n11 = b00 + b01 + b02 + b10 + b12 + b20 + b21 + b22; int n12 = b01 + b02 + b03 + b11 + b13 + b21 + b22 + b23; int n21 = b11 + b12 + b13 + b21 + b23 + b31 + b32 + b33; int n22 = b10 + b11 + b12 + b20 + b22 + b30 + b31 + b32; return Make( Rule(q.NW.SE, n11), Rule(q.NE.SW, n12), Rule(q.SE.NW, n21), Rule(q.SW.NE, n22)); }

We’ve seen this half a dozen times before. The interesting bit comes in the recursive step. The key insight is:** for any n>=2, if you have an n-quad in hand, you can compute the (n-1) quad in the center, one tick ahead.**

How? We’re going to use *almost* the same technique that we used in QuickLife. Remember in QuickLife the key was observing that if we had nine Quad2s in the old generation, we could compute a Quad3 in the new generation with sixteen steps on component Quad2s. The trick here is almost the same. Let’s draw some diagrams.

Suppose we have the 3-quad from the image above. We compute the next generation of its four component 2-quads; the green quads are current, the blue are stepped one ahead.

We can use a similar trick as we used with QuickLife to get the north, south, east, west and center 2-quads of this 3-quad, and move each of them ahead one step to get five more 1-quads. I’ll draw the original 3-quad in light green, and we can extract component 2-quads from it that I’ll draw in dark green. We then move each of those one step ahead to get the blue 1-quads.

That gives us this information:

We then make four 2-quads from those nine: and **extract** the center 1-quad from each using the Center function (source code below). I’ll just show the northwest corner; you’ll see how this goes. We make the light blue 2-quad out of four of the blue 1-quads, and then the center 1-quad of that thing is:

We do that four times and from those 1-quads we construct the center 2-quad moved one step ahead:

Summing up the story so far:

- We can take a 2-quad forward one tick to make a 1-quad with our base case.
- We’ve just seen here that we can use that fact to take a 3-quad forward one tick to make a 2-quad stepped forward one tick.
- But nothing we did in the previous set of steps depended on having a 3-quad specifically. Assume that for some n >= 2 we can move an n-quad forward one tick to make an (n-1) quad; we have above an algorithm where we use that assumption and can move an (n+1)-quad forward to get an n-quad.

That is, we can move a 2-quad forward with our base case; moving a 3-quad forward requires the ability to move a 2-quad forward. Moving a 4-quad forward requires the ability to move a 3-quad forward, and so on.

As I’ve said many times on this blog, every recursive algorithm is basically the same. If we’re in the base case, solve the problem directly. If we’re not in the base case, break up the problem into finitely many smaller problems, solve each, and use the solutions to solve the larger problem.

**Let’s write the code to move any n-quad for n >= 2 forward one tick.**

We’ll need some helper methods that extract the five needed sub-quads, but those are easily added to Quad. (Of course these helpers are only valid when called on a 2-quad or larger.)

public Quad Center => Make(NW.SE, NE.SW, SE.NW, SW.NE); public Quad N => Make(NW.NE, NE.NW, NE.SW, NW.SE); public Quad E => Make(NE.SW, NE.SE, SE.NE, SE.NW); public Quad S => Make(SW.NE, SE.NW, SE.SW, SW.SE); public Quad W => Make(NW.SW, NW.SE, SW.NE, SW.NW);

And then I’ll make a static method that takes a Quad and returns the center stepped forward one tick. (Why not an instance method on Quad? We will see in a moment.)

private static Quad Step(Quad q) { Debug.Assert(q.Level >= 2); Quad r; if (q.IsEmpty) r = Empty(q.Level - 1); else if (q.Level == 2) r = StepBaseCase(q); else { Quad q9nw = Step(q.NW); Quad q9n = Step(q.N); Quad q9ne = Step(q.NE); Quad q9w = Step(q.W); Quad q9c = Step(q.Center); Quad q9e = Step(q.E); Quad q9sw = Step(q.SW); Quad q9s = Step(q.S); Quad q9se = Step(q.SE); Quad q4nw = Make(q9nw, q9n, q9c, q9w); Quad q4ne = Make(q9n, q9ne, q9e, q9c); Quad q4se = Make(q9c, q9e, q9se, q9s); Quad q4sw = Make(q9w, q9c, q9s, q9sw); Quad rnw = q4nw.Center; Quad rne = q4ne.Center; Quad rse = q4se.Center; Quad rsw = q4sw.Center; r = Make(rnw, rne, rse, rsw); } Debug.Assert(q.Level == r.Level + 1); return r; }

Well that was easy! We just do **nine recursions** and then reorganize the resulting nine one-tick-forward quads until we get the information we want, and return it. (I added a little easy out for the empty case, though strictly speaking that is not necessary.)

There are probably three things on your mind right now.

- If we get a full quad-size smaller every time we step, we’re going to get down to a very small board very quickly!
- QuickLife memoized the step-the-center-of-a-2-quad operation. Why aren’t we doing the same thing here?
- Nine recursions is a lot; isn’t this going to blow up performance? Suppose we have an 8-quad; we do nine recursions on 7-quads, but
*each*of those does nine recursions on 6-quads, and so on down to 3-quads. It looks like we are doing 9^{n-2}calls to the base case when stepping an n-quad forward one tick.

First things first.

When do we not care if we’re shrinking an n-quad down to an (n-1)-quad on step? When *all* living cells in the n-quad are *already* in the center (n-1)-quad. But that condition is easy to achieve! Let’s actually write our public step method, not just the helper that steps a quad. And heck, let’s make sure that we have more than enough empty space. Remember,* empty space is super cheap. *

sealed class Gosper : ILife, IDrawScale, IReport { private Quad cells; private long generation; ... public void Step() { Quad current = cells;

Make **cells** bigger until there are two “levels” of empty cells surrounding the center. (We showed Embiggen last time.) That way we are definitely not throwing away any living cells when we get a **next** state that is one size smaller:

if (!current.HasAllEmptyEdges) current = current.Embiggen().Embiggen(); else if (!current.Center.HasAllEmptyEdges) current = current.Embiggen(); Quad next = Step(current);

We’ve stepped, so **next** is one size smaller than **current**. Might as well make it bigger too; that’s one fewer thing to deal with next time. Again, empty space is cheap.

cells = next.Embiggen(); generation += 1; }

HasAllEmptyEdges is an easy helper method of Quad:

public bool HasAllEmptyEdges => NW.NW.IsEmpty && NW.NE.IsEmpty && NE.NW.IsEmpty && NE.NE.IsEmpty && NE.SE.IsEmpty && SE.NE.IsEmpty && SE.SE.IsEmpty && SE.SW.IsEmpty && SW.SE.IsEmpty && SW.SW.IsEmpty && SW.NW.IsEmpty && NW.SW.IsEmpty;

That was an easy problem to knock down. Second problem: QuickLife memoized the 2-quad-to-1-quad step algorithm and got a big win; shouldn’t we do the same thing?

Sure, we have a memoizer, we can do so easily. But… what about our third problem? We have a recursive step that is creating exponentially more work as the quad gets larger as it traverses our deduplicated tree structure.

Hmm.

It is recursing on a *deduplicated* structure, which means i*t is probably going to encounter the same problems several times*. If we move a 3-quad forward one step to get a 2-quad, we’re going to get the same answer the second time we do the same operation on the same 3-quad. If we move a 4-quad forward one step to get a 3-quad, we will get the same answer the second time we do it. And so on. **Let’s just memoize everything**.

We’ll rename Step to UnmemoizedStep, create a *third* memoizer, and replace Step with:

private static Quad Step(Quad q) => CacheManager.StepMemoizer.MemoizedFunc(q);

And now we have solved our second and third problems with one stroke.

Let’s run it! We’ll do our standard performance test of 5000 generations of acorn:

Algorithm time(ms) size Mcells/s Naïve (Optimized): 4000 8 82 Abrash (Original) 550 8 596 Stafford 180 8 1820 QuickLife 65 20 ? Gosper v1 3700 60 ?

Oof.

It’s slow! Not as slow as the original naïve implementation, but just about.

Hmm.

That’s the time performance; what’s the memory performance? There’s a saying I’ve used many times; I first heard it from Raymond Chen but I don’t know if he coined it or was quoting someone else. *“A cache without an expiration policy is called a memory leak”.* Memory leaks can cause speed problems as well as memory problems because they increase burden on the garbage collector, which can slow down the whole system. Also, dictionaries are in theory O(1) access — that is, access time is the same no matter how big the cache gets — but theory and practice are often different as the dictionaries get very large.

How much memory are we using in this thing? The “empty” memoizer never has more than 61 entries, so we can ignore it. I did some instrumentation of the “make” and “step” caches; after 5000 generations of acorn:

- the step and make caches both have
*millions*of entries **half**the entries were**never read at all, only written****97%**of the entries were read**fewer than twenty times**- the
**top twenty**most-read entries account for**40%**of the total reads

This validates our initial assumption that there is a huge amount of regularity; the “unusual” situations recur a couple dozen times tops, and we spend most of our time looking at the same configurations over and over again.

This all suggests that we could benefit from an expiration policy. There are two things to consider: what to throw away, and when to throw it away. First things first:

- An LRU cache seems plausible; that is, when you think it is time to take stuff out of the cache, take out some fraction of the stuff that has been Least Recently Used. However LRU caches involve making extra data structures to keep track of when something has been used; we do extra work on each step, and it seems like that might have a performance impact given how often these caches are hit.
- The easiest policy is: throw it all away! Those 20 entries that make up 40% of the hits will be very quickly added back to the cache.

Let’s try the latter because it’s simple. Now, we cannot just throw it *all* away because we must maintain the invariant that Make agrees with Empty; that is, when we call Make with four empty n-quads and when we call Empty(n+1) we must get the same object out. But if we throw away the “make” cache and then re-seed it with the contents of the “empty” cache — which is only 61 entries, that’s easy — then we maintain that invariant.

When to throw it away? What we definitely do not want to happen is end up in a situation where we are throwing away stuff too often. We can make a very simple dynamically-tuned cache with this policy:

- Set an initial cache size bound. 100K entries, 1M entries, whatever.
- Every thousand generations, check to see if we’ve exceeded the cache size bound. If not, we’re done.
- We’ve exceeded the bound. Throw away the caches, do a single step, and re-examine the cache size; this tells us the cache burden of doing one tick.
- The new cache size bound is the larger of the current bound and twice the one-tick burden. That way if necessary the size bound gradually gets larger so we do less frequent cache resetting.

The code is straightforward; at the start of Step:

bool resetMaxCache = false; if ((generation & 0x3ff) == 0) { int cacheSize = CacheManager.MakeQuadMemoizer.Count + CacheManager.StepMemoizer.Count; if (cacheSize > maxCache) { resetMaxCache = true; ResetCaches(); } }

“ResetCaches” throws away the step cache and resets the make cache to agree with the empty cache; I won’t bother to show it. At the end of Step:

if (resetMaxCache) { int cacheSize = CacheManager.MakeQuadMemoizer.Count + CacheManager.StepMemoizer.Count; maxCache = Max(maxCache, cacheSize * 2); }

All right, let’s run it again!

Algorithm time(ms) size Mcells/s Naïve (Optimized): 4000 8 82 Abrash (Original) 550 8 596 Stafford 180 8 1820 QuickLife 65 20 ? Gosper v2 4100 60 ?

It’s worse. Heck, it is worse than the naive algorithm!

Sure, the top twenty cache entries account for 40% of the hits, and sure, 97% of the entries are hit fewer than twenty times. But the statistic that is relevant here that I omitted is: the top* many hundreds* of cache entries account for 50% of the hits. We don’t have to rebuild *just* the top twenty items in the cache to start getting a win from caching again. We take a small but relevant penalty every time we rebuild the caches.

Sigh.

We could keep on trying to improve the marginal performance by improving our mechanisms. We could try an LRU cache, or optimize the caches for reading those top twenty entries, or whatever. We might eke out some wins. But maybe instead **we should take a step back and ask if there’s an algorithmic optimization that we missed.**

**Next time on FAIC:** There is an algorithmic optimization that we missed. Can you spot it?

My first thought is that instread of clearing the cache, rename the Cache as OldCashe and crease a new Cache with the items from empty, Then when Cache is missed, try getting the item from OldCache and add it to Cache. When the Cache gets too large next time, OldCache is left for the GC to clean up.

In the past I have also got good result with a “ramdom with 2nd chance” emission policy.

Sure; that’s a simplified form of an LRU cache. I actually made a cache a little bit like that for Roslyn back in the early days, because we identified that a common scenario was VS doing a lot of lexical and grammatical analysis of source code currently on the user’s screen, and of the currently opened document, and that moreover, a common scenario was a user switching back and forth between two documents. The cache we built was basically “current / recent / everything else”. Not sure if it is still in use in Roslyn; this was a decade ago now.

Anyway, there are certainly many such attempts we could make to improve the cache performance, but what I am getting at here is that the more fundamental problem is that there is a missed trick in the algorithm itself, not in the details of the cache topology. See if you can spot it!

We are correctly calculating (for a 2-quad) 6×6=36 cells for the next step, but use only 4×4=16 of those.

Instead of stepping 9 subquads, we could do with just 4.

I’m not sure that would change orders of magnitude of performance, though.

When you start with a 3-quad you do a bunch of manipulations to create 9 2-quats to calculate 9 1-quads and then do a bunch of manipulations to create a 4 1-quats from those 9, just to create a single 2-quad.

Quicker would be to ‘simply’ calculate the 4 1-quads you need to construct the 2-quad. It’s nice we could ‘reuse’ the QuickLife algorithm, but we’re not in Kansas anymore.

This would cut the amount of calculations we have to do in half. And we have to do fewer manipulations as well.

So a 2x improvement… per *level*. For once a 2^n in our favor!

The previous two comments both give the same idea. We are given an n-quad and want the center (n-1)-quad stepped. Instead of constructing nine (n-1)-quads, getting nine stepped (n-2)-quads, and then extracting the stepped (n-2) quad we want, we could simply *directly* construct four unstepped (n-1)-quads and step each of those.

That extraction is a bunch of steps but it is certainly possible, and if we did so, we would indeed go from nine steps per recursion to four; this is a good observation.

That said: there is a reason why I described the algorithm the way I did! Tune in next time for the answer.

The change to four Step()s instead of nine has a nice duality with your current implementation – see the gist under my username.

But performance testing shows that the extra work in extracting the right quads

more than compensates for the savings in Step() calls – performance is worse :). Using the nine n-2 quads, the border eight n-1 quads are reused for the neighboring steps, so we have more benefit from the caches.

My inclination would have been to make a step function that takes 9 quads as input, and returns the center quad advanced one step. The limiting case would be the case for 0-quads—the regular rules for Life. Otherwise it would recurse 4 times, calling itself on the four quadrants of the center quad, with the neighbors for each.

I’m not sure whether this would be more efficient, but it seems like it would do less rebuilding of quads, anyway.

That’s a really interesting idea. I see no reason why it would not work in principle. Give it a try if you like and report back the results!

Yes I was also thinking about the “waste” of cells when going from an n-quad to an (n-1)-quad. When calculating a step of an n-quad, in principle you need to “lose” only a single row of cells around the perimeter. When using only the middle (n-1)-quad, you’re left with only a quarter of the number of original cells. The bigger n is, the bigger the difference between those values. It seems like we’re wasting information here that could be exploited at little or no cost — somehow.

Pingback: Life, part 35 | Fabulous adventures in coding