Life, part 36

This was supposed to be the last episode of this series, but while writing it I discovered a bug in my implementation of Hensel’s QuickLife algorithm. It’s a small defect, easily fixed, but I thought it might be interesting to dig into it. (The bug is in the code from episodes 30 through 35; I’ve fixed it in the episode 35 branch.)

When I’m investigating a defect, there are a lot of stages I go through. Today we’ll look at:

  • Make a minimal reproducer
  • Determine the exact code path that causes the defect
  • Understand what went wrong at the level of the code
  • Fix it
  • Do a root cause analysis
  • Sum up the lessons learned

Of course if this were a product there would be more steps in there, such as code review of the fix, writing regression tests, getting the regressions code reviewed, adding assertions that verify the violated invariant elsewhere, reviewing related code for similar defects, determining if customers are affected by the bug or the fix, and so on. Obviously we won’t go through those here!

It took me some time to find a minimal repro; I won’t go through how I did so because it was tedious trial and error; suffice to say that I found a very complicated reproducer by accident and then made it smaller and smaller until I could go no further.

The minimal repro is very straightforward; we load this pattern into an empty QuickLife instance and step it two ticks. There is only one active Quad4. The even generation of the active Quad4 looks like this:

The first step we are going from even generation zero to odd generation one; recall that in QuickLife, the odd generation is offset by one cell in each direction, so odd generation one should look like this:

That’s a still Life, so if we’ve done things correctly, even generation two should look like:

But we have not done things correctly, and it does not look like this! Instead we observe that generation two is displayed as being the same as generation zero and that moreover, this pattern continues on subsequent even generations. We have a little isolated one-cell blinker, which should be impossible.

(ASIDE: This is the simplest version of this defect; while investigating I also found more complex versions where the “spark” part was in an adjacent Quad4 and the “block” part was not a still Life. Going through the more complex scenarios would make this long post much longer but both the root cause and the fix are the same, so I will skip that part of the analysis.)

What code path leads to this defect?

First, a quick refresher. Recall that QuickLife gets its speed in several ways. The core logic that we are concerned about for this bug is:

  • A Quad4 represents two generations of a 16×16 fragment of the board; one even, one odd.
  • Each of the eight Quad3s in a Quad4 keeps track of four “regions” and whether those regions are active, stable or dead. By “stable” we mean “exactly the same as two ticks previous”, and by “dead” we mean “stable, and every cell is dead”.
  • If a Quad3 in an active Quad4 is stable or dead then we can skip computing its next generation because it will be the same as the previous generation.

The code I wrote to implement this earlier in the series is usually correct, but it is subtly wrong on generations zero and one, which of course can then cause the computation to be wrong on every subsequent generation. Let’s trace through it carefully and see where things go pear shaped.

Initial state:

  • The even half of the Quad4 cell state is as given in the first diagram.
  • The odd half of the Quad4 cell state is all dead.
  • All even and odd regions of all Quad3s are set to “active”.

To step from generation zero to generation one, remember we execute this code:

private void StepEven()
  foreach (Quad4 c in active)
    if (!RemoveStableEvenQuad4(c))
    c.StayActiveNextStep = false;

The even half of the Quad4 is marked active so we do not remove it, and enter the body of the condition:

public void StepEven()

Things are going to go wrong in the NE Quad3, so let’s take a look at that.

private void StepEvenNE()
  if (OddNEPossiblyActive())
    Quad3 newOddNE = Step9Quad2ToQuad3Even(...);
    OddNEState = oddNE.UpdateOddQuad3State(newOddNE, OddNEState);
    oddNE = newOddNE;
    OddNEState = oddNE.MakeOddStableOrDead(OddNEState);

The first thing we do is check if the odd cycle of the NE Quad3 could possibly change from its current state; since the even cycle is marked as active, it could. We enter the consequence of the condition and correctly compute that the odd NE Quad3 is all dead. There is only one isolated living cell there, and that’s not enough to sustain life.

Here’s the first thing that goes terribly wrong. UpdateOddQuad3State compares the new odd NE Quad3 to the previous one and discovers both of them are “all dead”. Remember our definition of the state bits: in order for a Quad3 region to be marked as “dead” it must be (1) all dead, and (2) stable; all dead twice. We therefore mark the odd NE Quad3 as “dead” in all four regions.

It seems like everything is working correctly so far. I won’t go through the rest of the even-to-odd step in detail; summing up:

  • All four Quad3s on the even cycle are still marked as active in all regions
  • The odd SW Quad3 is marked as active overall and on its east edge, because it is different from how it was in the “previous” generation.
  • The odd NE, NW and SE Quad3s are marked as “dead” in all regions.

Now let’s look at what happens on the generation-one-to-generation-two step, just in the northeast Quad3 of this Quad4:

private void StepOddNE()
  if (EvenNEPossiblyActive())
    Quad3 newEvenNE = Step9Quad2ToQuad3Odd(...);
    EvenNEState = evenNE.UpdateEvenQuad3State(newEvenNE, EvenNEState);
    evenNE = newEvenNE;
    EvenNEState = evenNE.MakeEvenStableOrDead(EvenNEState);

Once again, the first thing we check is whether there is any point in recomputing the even state; if we know it is going to be the same this time as last, then we can skip it… and… wait a minute:

private bool EvenNEPossiblyActive() =>
  OddNortheastOrBorderingActive ||
  N != null && N.OddSouthEdge10EastActive;

We just marked the odd NE, NW and SE Quad3s as “stably dead in all regions” so OddNortheastOrBorderingActive is false; there is no Quad4 to the north, but if there were, it would be all dead also. The wrong conclusion we reach is: on the next tick, the northeast corner of this Quad3 must be the same on generation 2 as it was on generation 0 so we can skip computing it.

We therefore enter the alternative (else) branch, call MakeEvenStableOrDead, and mark the even NE Quad3 as “stable”. This is obviously wrong, and worse, that wrongness then persists forever because the whole Quad4 will soon be marked as “stable” and we will have a period-two oscillator between the states illustrated in the first two diagrams above.

The appropriate fix is determined by understanding what has really gone wrong at the level of the code. What invariant did we depend upon that was violated?

If we’re stepping from an even generation to an odd generation, the current generation is stored in the even Quad3s, and the previous generation is stored in the odd Quad3s. But there is an important assumption made by these optimizations: we assume that the current generation was itself computed by stepping the previous generation. The reasoning is:

  • Generation -1 was all dead in the NE Quad3
  • Generation 0 was the result of stepping generation -1 — uh oh
  • Stepping generation 0 gives us generation 1 all dead in the NE Quad3
  • Generations -1 and 1 are the same in the NE Quad3; it is stable
  • Therefore generation 2 is also stable
  • Generation 2 is the same as generation 0 in the NE Quad3 so we can skip computing it and mark it as stable.

The second premise is false, so the conclusion does not follow.

There are a lot of ways to fix this. What we’re going to do here is keep track of one more piece of board-global state: was the current generation actually computed from the previous generation? Or put another way, is the stored cell state of the previous generation correct? The vast majority of the time it will be, but it is not if we have just loaded a pattern in.

If the previous generation is correct then our algorithm is already correct (I hope!) and we do not need to do anything. What should we do if the previous generation is not correct?

The tempting thing to do is to make that flag a parameter to the step functions. Where did things first go wrong? When we marked the odd NE Quad3 as “stably dead”. We could pass a flag in that says “when stepping even to odd, if the previous odd generation is incorrect then do not compare the new odd cells to the previous odd cells; instead mark them all active.” On the next step we would then not skip computing the new even state, and all would be well.

However, there are now eight code paths that we would have to plumb this flag through. An easier, hackier solution — the solution Hensel applied in the original QuickLife source code — is to allow the step to compute the “wrong” stability flags and then fix them later:

private void StepEven()
  foreach (Quad4 c in active)
    if (!RemoveStableEvenQuad4(c))
    c.StayActiveNextStep = false;
    if (!previousCorrect)
  previousCorrect = true;

And similarly on the odd cycle. That’s our fix.

Whenever I fix a defect, I try to spend some time asking myself how this defect came to be in the code in the first place. In this particular case, that’s easy. I remember very clearly the faulty thought process.

As I’ve mentioned before, Hensel’s explanation of the basic ideas underlying the original QuickLife implementation is very clear, but the code that implements it is arcane in many ways. There are loop bodies and conditional bodies with hundreds of lines of dense, bit-twiddling code. The bit manipulation is all raw hex numbers. The variable naming conventions are obscure; p means even, q means odd, for instance. It took me a long time to understand the details of the algorithm, and I wanted to simplify my task by ignoring anything that wasn’t directly related to the actual optimizations.

One of the nice-to-have features of QuickLife (and several other algorithms we’ve looked at) that I did not spend any time in this series addressing is: because we have the previous and current states stored, you could easily add a handy “step backwards one tick” feature to the UI. And in fact the original Java implementation of QuickLife has this feature.

Now, you have to be careful about this, for two reasons. First, obviously once you have stepped backwards once, you cannot do so again. You need to keep track of whether you’ve already stepped back once or not and disallow it. But there is a bigger problem which, given the preceding material in this ever-longer post, you have undoubtedly already deduced. If we were on an even generation, then the current state is in the even state and the previous state is in the odd state. If we step back one tick, then the current state is in the odd state, but the even state is not the previous state; it is the next state. We need to make sure that when we do the odd-to-even step, we do not come to the erroneous conclusion that the even state is all stable!

The original code has a very clearly-named state variable; it has a field backcorrect which indicates whether the previous state is correct or not. If set to false, then the algorithm does exactly what I’ve done in my bug fix: upon stepping, it marks every region of every Quad3 to “active”, and then finally sets the correctness flag to true.

My mistake was that I wrongly believed upon initially reading the code that I could ignore backcorrect because it was only for implementing the step-backwards feature, which I was not going to add to my UI. I completely missed the fact that Hensel sets this flag to false whenever the user forces a change to cell state, and that setting that flag to false is crucial for ensuring the correctness of the algorithm in that scenario.

I feel a keen sense of irony that after figuring out all the obscure and arcane parts of the algorithm, I missed the importance of this state bit that was more clearly named than all the other state bits.

That was the root cause of the defect, but there were other contributory factors:

  • I made the classic high school math mistake of reasoning inductively without establishing a base case. In this blog series I made arguments for the correctness of this algorithm based on the assumption that the previous generation was the real previous generation of the current generation. But that assumption breaks down on the base case of generation zero.
  • My test regimen was inadequate. I have no unit tests or end-to-end tests; all my testing for this entire series has been completely ad-hoc-try-it-out-in-the-UI-and-see-if-it-looks-right.
  • I have very few assertions in the code.
  • When implementing a complex and easily misunderstood algorithm like QuickLife, I probably should have built a “check mode” implementation of the ILife interface that takes two implementations, does all the same operations to both, and verifies that the new board states are the same in both implementations. I could then have written a “random soup” test generator and verified my implementations against each other.

I (re)learned some lessons from this bug. From the specific to the general:

  • When you’re porting code you didn’t write and there is a part you think is unnecessary to port, really dig in and understand whether it can be safely removed.
  • I made good arguments about the correctness of the “steady state”, but I did not establish the correctness of the code on “boundary” conditions such as “it’s the very first step”. Edge cases are not necessarily rare cases; every program starts at an edge case.
  • A lesson I have learned many times in my life as a compiler developer is strongly reinforced by this bug: every time you cache the result of an analysis of a mutable data structure for performance reasons you create an opportunity for a bug should a mutation cause the cached analysis to become incorrect. I’ve fixed so many compiler bugs like that.
  • The whole point of this series is that you can sometimes find specific aspects of the “business domain” of your computation and use those to drive performance optimizations. If those optimizations depend on invariants that can be forced to be violated, the optimization will lead to an incorrect computation!

We took advantage of the rules of Life to get wins; when we force those rules to be violated, computations that depend on those rules become incorrect.

Next time on FAIC: Let’s finish this thing up! Are there patterns that grow quadratically?

9 thoughts on “Life, part 36

  1. Thanks for the write-up! I had been idly wondering through all this (as you’ve been implementing various algorithms) how you’d confirmed that (1) the many-iterations-of-acorn was the meaningful metric to measure, and (2) that it was running correctly. I assumed that acorn was some sort of standard test covering most functionality, but I was curious if there were other “standard” Life-implementation tests that you used (or could use) for correctness and benchmarking. (I’m sure there must be literature out there on it that I’m just not bothering to search for at the moment.)

    • First off you are very welcome; I’m glad you found this long screed interesting.

      I don’t know whether there are standard benchmarks either. I chose many-iterations-of-acorn as my benchmark because:

      * It is an interesting pattern that I was familiar with. Lots of small patterns devolve into still Lifes or low-period oscillators early, and a few like glider guns hit a “steady growth” state early. Acorn is interesting in that it is a tiny pattern that takes thousands of generations to settle down, and grows large compared to its original size. It is not obvious to the newcomer that there are patterns like that.

      * Aside from the half dozen gliders it throws out to infinity, it never exceeds the bounds of a 256 x 256 grid but it fills up a lot of it. This lets me use it as a fair test case for the early algorithms that really are only practical on small boards.

      * There is a clear demarcation between the “chaos” and “stable oscillators and gliders” periods. When I started this series I knew I was going to do Gosper’s HashLife, but I had not yet decided whether to do QuickLife or not. But I knew from my previous experiments with HashLife that this would be an interesting test case precisely because of the sharp bend in the performance curve. Once I had determined to do QuickLife, I knew it would be interesting because of the way QuickLife treats period-two oscillators as stable.

      As for correctness, like I said, I just eyeballed it. Once you’ve seen Life patterns evolve it is pretty easy to detect when something is going seriously wrong. But in this particular case, I really should have done some more principled correctness testing.

  2. Thanks for the analysis. The example reinforces a lot about what’s hard in programming, and what strategies are important.

    I find interesting though that much of this really comes down to this statement above:

    “every time you cache the result of an analysis of a mutable data structure for performance reasons you create an opportunity for a bug should a mutation cause the cached analysis to become incorrect”.

    I still remember (albeit vaguely) one of the earliest times I saw this sentiment. Raymond Chen summed it up in his “Old New Thing” blog with a statement along the lines of “don’t cache what you can compute” (sorry, I’m not able to find the exact reference…but I’m sure it’s out there somewhere).

    We (you included, I presume) came up in programming in a time when PC hardware was in its infancy and was very limited in performance (not that mainframes were a ton better back then, but they scaled a lot better). Computationally interesting programs were unusable without optimizations, and often these took the form of caching. This was especially the case for GUI-related logic (e.g. don’t redraw something unless you absolutely have to). Chen’s admonition came well after these times had passed, and was well-timed to remind us “old timers” that old habits die hard, but are often counter-productive.

    Of course, in your case above, the cache is the whole point. It’d be possible to just recompute the state each frame, but we’d miss out on the very important performance improvements that were the goal in the first place! So the admonition doesn’t apply here. But the example above is still a great reminder, nonetheless. 🙂

    • Yeah, Raymond has a great many pithy expressions which I have made use of over the years. 🙂

      And you are absolutely right; when I first joined the scripting team at Microsoft I made a survey of the VBScript and JScript source code and found over a dozen implementations of caching logic, each one optimized for a different narrow scenario. There was a move-to-front cache for name lookups in JS objects so that if you were calling a method on an object in a loop, the second time you called it you would get the cache hit immediately. That was different than the local variable names cache, which was optimized for a different usage pattern, and so on. People think of the pre-jitted JS engines as slow — and they were — but you also have to remember that they were built to run on 16 bit Windows. There were real tradeoffs to consider between size and speed when you have very little of either available.

      The pre-Roslyn C# compiler was also a mass of caches; not only was there again a great many implementations of basic caching logic, but worse, there were lots of places where a code optimization was computed, cached, optimized a second time, and then there was a cache hit that caused the original optimized form of the code to get mixed up in some structure that expected the second optimization to have been performed. I did an analysis of such a bug here: It was a mess. For Roslyn I rewrote all that code so that, first, as you say, we do not store what we can compute, and second, to greatly reduce the amount of mutable state that could become inconsistent.

  3. Pingback: The Morning Brew - Chris Alcock » The Morning Brew #3085

  4. Maybe this is a good time to mention Golly, an “open source, cross-platform application for exploring Conway’s Game of Life and many other types of cellular automata.” It includes QuickLife and HashLife algorithms. Also, lots of starting patterns are included. One of the more intriguing concepts is the MetaPixel: a square of cells who together implement the usual rules of Life. It is so big and slow that it is only really workable thanks to the HashLife algorithm.

  5. Pingback: Life, part 30 | Fabulous adventures in coding

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s