About ericlippert

http://ericlippert.com

The VSTO startup sequence

Earlier this week I was looking for an old photo, and while browsing I came across a photo I took of my whiteboard in my office at Microsoft in 2004. Or rather, it was two photos; I’ve crudely stitched them together. Click on the image for a larger version.

OMG. What. The. Heck. Is. All. That. Nonsense?

Let me start with a little history.

Before I was on the C# team and after I was on the scripting languages team, I spent a couple years at Microsoft working on the plumbing for Visual Studio Tools for Office.

The idea of VSTO was to bring the ability to write truly rich, client-server aware, data-driven applications in Office documents using C# or Visual Basic; we wanted to go beyond the scripts and productivity tools typical of VBA customizations.

This was a project with many, many difficulties, both technical and political. On the political side of things, I think it is fair to say that the Office team has historically had (with good reason!) a great deal of resistance to taking any compatibility burden that would possibly slow down their ability to innovate in future releases; “platformizing” Word and Excel into true application hosts by a team external to Office was just such a burden.

The technical difficulties were considerable, in large part due to concerns about the security model. We were deeply, painfully aware of how Office customizations and scripting languages had been misused in the past as vectors for malware, and we did not want to create new vulnerabilities. As mitigation, we designed a mechanism that would isolate any customization code to its own appdomain with a highly restrictive default security policy.

Office, however, was not at the time designed to host the CLR. They were only willing to give us a single callback to our loader code that kicked off the whole process when a customized spreadsheet or document was loaded.

By 2004 we were on the fourth revision to our loading algorithm and I was tasked with coming up with the fifth; to facilitate discussion of options I drew a diagram on my whiteboards which I helpfully titled “HIGHLY SIMPLIFIED STARTUP SEQUENCE v4”.

A few things strike me about this diagram now, over 16 years later.


First: though it looks like a mess, I did actually put some thought into the design elements.

  • The diagram is divided into three sections, separated by heavy blue vertical lines. On the left are components running entirely outside of the CLR; in the middle are components that run in the CLR’s default appdomain, and on the right are components that run in the customization’s restricted appdomain. (And of course on the extreme far left is the edge of my “THE MATRIX” poster. A lot of the code names of the different parts of the project were references to The Matrix, including the team cover band that I played keyboards for. I am sad that I can no longer wear my “The Red Pills” polo shirt in public due to the co-opting of that movie reference by misogynist jerks.)
  • The purple boxes that run along the top are components and the lollipops give the interfaces they implement.
  • The purple boxes and arrows below give the exact sequence of twenty different method calls showing what component is calling what other component with what data, and why. In particular the diagram allows us to easily see when a component running in a more restricted security environment is calling into a less restricted environment; those calls need to be allowed because we need them to happen, but that then means that maybe hostile user code could call them, which could be bad.
  • Design problems, questions, annotations and proposed changes are shown in blue.
  • Red is used to identify one key component and an important question about it.
  • I have no idea what that yellow code fragment is or why it was written over top of everything else. It looks irrelevant.

The purpose of the diagram was originally to clarify in my own mind what the sequence was and what the problems were, but by the time it was in this form it was also for giving context to my coworkers when we were discussing options, so it had to be readable. I probably redrew this diagram a half a dozen times before it got to this state.


Second: we can see that there were a good half dozen or more design problems that I was trying to solve here but the big problems involved dirty documents and application manifests.

When you close Word or Excel, you have I am sure noticed that sometimes you get a “save your work?” dialog and sometimes the app just closes. The app is keeping track of whether the document is dirty — changed since it was last loaded or saved — or clean.

Suppose we load a customized spreadsheet, and initializing the customization causes the installer to notice that there is a newer version that it should be downloading. That might change the manifest information about the customization, so the spreadsheet is now “dirty”. But we do not want to ever unnecessarily dirty a document, because that is confusing and irritating to the user.

In step nine the fake application activator obtains an IAppInfo reference from the appdomain manager, updates the manifest from the customization’s server, and parses the manifest. My comments say:

  • Do not write back at this point; need to maintain dirty state
  • No, don’t do this at all. Host must provide updated manifest. This is not a VSTA feature, it is VSTO. (Meaning here that something here is unique to Visual Studio Tools for Office, and not the generalization of it we were working on, VST for Applications.)
  • Must do both. Don’t write back. AIState object must ensure dirtyness.

Apparently I was deeply conflicted on this point. I don’t recall how it was resolved.

My favourite comment though is the one in red:

Can we take manifest out of doc? Peter: “It would be awesome. If assembly is available offline, so is manifest”.

The scenario here had something to do with both the dirty bit problem, and more generally dealing with locally cached customizations. We did a bunch of work on the security model for “what happens if you’re trying to run a customization while your laptop is in airplane mode and we can’t get to the server to check for updates”. Peter is of course legendary Microsoft PM Peter Torr with whom I worked for many years.

My second favourite was where I said “OFFICE12?” Yeah, what’s going to happen when Office revs? Can we guarantee that all this stuff keeps working?


Third: It’s funny how the mind works. Though I’ve described the organization of the diagram and the major problems, today I remember almost none of what is going on here, what the specific issues were, or how we resolved them. But that whole sequence was intensely important to me for several months of my life; it was the foundational plumbing to the entire endeavor and so I knew it literally forwards and backwards. Those memories are 90% gone now. And yet if someone were to yell the numbers “nine six seven eleven eleven” at me from across the street I would be unable to not think “call Pizza Pizza, right away“. Thanks, 1980s jingle writers.


Fourth: I often think about this sort of thing in the context of those “tell me about a time you solved a design problem” interview questions. This “highly simplified” startup sequence with its twenty method calls has to balance:

  • security
  • performance
  • privacy
  • debuggability
  • code maintainability
  • versioning
  • robustness
  • graceful failure
  • user irritation

and numerous other design criteria. But can you imagine trying to explain any detail of this diagram to someone with no prior knowledge in a 45 minute interview? Real-world design problems are hard precisely because there are so many conflicting goals and messy politics. And worse, too often this is the institutional knowledge that is never written down and then lost.


Coming up on FAIC: Not sure!

  • I want to embark upon a more detailed dive into Bean Machine
  • We have just open-sourced a tool we use for benchmarking PPLs internally; I’d like to talk about that a bit
  • I’ve developed a little AST rewriting library in Python that is kinda fun; I could delve into the ideas behind that.

Let me know in the comments what you think.

Life, part 37

All right, let’s finish this thing off and finally answer the question that I’ve been asking for a while: are there any Life patterns that have unbounded quadratic growth? (Code for this final episode is here.)

The answer is yes; there are several different kinds of patterns that exhibit quadratic growth. Today we’ll look at one of them.

We know that guns, puffers and rakes all produce linear growth. But if we had a puffer whose debris was glider guns, that thing would have linear growth of glider guns, each of which has linear growth of gliders, and so that would be quadratic growth! Such a pattern is called a “breeder”.


ASIDE: Quadratic growth is the upper limit; do you see why? Making an argument for why no finite Life pattern can grow the total number of living cells more than quadratically is left as an exercise for the reader.


Let’s see if we can make this happen.

Back in episode 22 I posted a puffer/rake that produces loafs, blocks and two streams of gliders, and said it was a simplified version of “backrake 3”.

There is rather a lot we can do to modify this pattern to change the stuff left in its wake. For example, if we put another pair of spaceships slightly behind the rake, the little “spark” that the spaceship makes interacts with the loaves-and-blocks debris trail but not the gliders. The resulting chaos has the effect of destroying every other block, and all the loaves:

If we put a second pair of spaceships even further behind, they clean up the remaining blocks. This is the pattern “backrake 3”:

What’s more, if we leave out one of those trailing spaceships, that’s fine. We end up destroying the blocks on just one side and the other block remains.

Rounding out our collection of modifications, we can also put a spaceship off to either side such that when a glider collides with it, the glider is destroyed but the spaceship lives:

Summing up: by modifying our original puffer/rake we can produce any combination of:

  • Gliders above, below, or neither
  • Blocks and loaves, pairs of blocks, single blocks, or nothing.

That’s interesting, but how does it get us to a quadratic-growth pattern? Patience! We will get there.

The next thing to consider is: we’ve seen that we can turn blocks-and-loaves into no loaves and half as many blocks. And we’ve seen that we can go from gliders to no gliders on either side. Could we make the glider stream half as dense? For example, could we destroy every other glider? Yes, by combining two of our patterns into a larger one. We’ll combine:

  • Single block, single glider
  • No still Lifes debris, single glider

There is a way to cause a glider to hit a block such that both of them are destroyed, but the single blocks left behind by the puffer are spaced such that every other glider survives! Here the upper puffer is producing its usual spaceships and blocks, but the lower puffer has half of its gliders destroyed:

OK, so we can thin out a stream of gliders; that is another tool in our toolbox.

The next thing to know is that it is possible to create a huge number of interesting Life patterns by running gliders into each other; in fact, finding the “glider synthesis” of a pattern is a major area of focus for Life enthusiasts. For example, if you collide two gliders together like this:

Then six generations later, there will be a “pond” still Life:

If you then hit that pond with a second glider like this:

Then four generations later it becomes a ship still Life:

So we can create ponds with two gliders and ships with three. If we then hit that ship with a glider like this:

Then it turns into the unstable “queen bee” pattern:

But we know from episode 17 that two queen bees stabilized by two blocks makes a glider gun! We have rakes that produce streams of gliders, we have puffers that produce streams of blocks; put it all together and we get:

Breeder 1! Click on the image for a larger picture, load it into the executable for this episode, or take a look at it running on the wiki; it is quite hypnotic to watch all its intricate parts work together. With this pattern we add a new glider gun every 64 ticks, and then each glider gun produces a new glider every 30 ticks. It is obvious just from looking at the triangle of gliders that the population of gliders is growing as the square; it is making half a square.

It should come as no surprise that this pattern was created by Bill Gosper in the 1970s; just as he was the first to build a glider gun, so too was he the first to build a breeder.

What is the performance of our implementation of Hensel’s QuickLife compared to Gosper’s HashLife? Let’s start by thinking about what we should expect with QuickLife.  The time cost of one tick in QuickLife is proportional to the number of active Quad3s, and therefore the cost of computing tick number n should be proportional to n2. That means the total cost of computing ticks one through n should be proportional to n3.

I ran a similar test to the performance test from episode 35: I ran 8 generations, then restarted the engine and ran 16 generations, then restarted and ran 32 generations, and so on, up to 214 generations. This log-log scale chart is ticks on the x axis, milliseconds on the y axis:

Fascinating! We see that in the early part of the growth, we are basically linear; doubling the number of ticks doubles the amount of time it takes to compute. But by the time we are up to computing the first 8192 ticks or more, the total time is now growing as the cube, as we would expect. (Remember, the slope of the line on the log-log graph indicates the exponent of the geometric growth.)

What about Gosper’s HashLife algorithm? Same test: start from the original breeder pattern, step forward some number of ticks. Of course we reset the caches after every test so as to fairly assign work done by a test to that test.

The astonishing result: 76 milliseconds, no matter how many generations we are stepping ahead! Notice that I’m not saying 76 milliseconds per generation; 76 milliseconds, period. Want to see what the pattern does after 214 generations? QuickLife took almost three minutes to compute that; HashLife does it in 76 milliseconds. 224 generations? 234? Same cost! It looks like magic.

What is happening here is of course Gosper’s algorithm is exploiting the extraordinary amount of similarity in both space and time. The spaceship part of the breeder moves itself 32 cells and repeats its internal configuration every 64 ticks which is perfect for HashLife’s quad-based memoizing. The glider guns are all the same, and the half-square of gliders is easily deduplicated no matter how big it gets.


Let’s wrap it up here; I hope you enjoyed that exploration of Life algorithms as much as I did; I have been wanting to write this series for over a decade.

There is enormously more to Life; I feel like I have just scratched the surface in this series. There are many more patterns to explore and open problems to solve. And more generally there is active and interesting work being done by enthusiasts on cellular automata with different rules, different dimensions, different grid connectivity, and so on.

If you’re interested in learning more, I highly recommend two things; first, the latest news in Life discoveries is available on the wiki that I have linked to many times in this series. Second, try playing around with Golly, a much more fully-featured and high-performance engine for exploring cellular automata than the little toy I’ve made for pedagogic purposes in this series. It implements both Gosper’s HashLife and Hensel’s QuickLife, along with variants for exploring other rule sets.


Coming up on FAIC: I’m going to take a short break from blogging for a bit; that last series was rather longer than I anticipated! We will probably then continue looking at Bean Machine, the probabilistic programming language I work on these days.

Life, part 36

This was supposed to be the last episode of this series, but while writing it I discovered a bug in my implementation of Hensel’s QuickLife algorithm. It’s a small defect, easily fixed, but I thought it might be interesting to dig into it. (The bug is in the code from episodes 30 through 35; I’ve fixed it in the episode 35 branch.)

When I’m investigating a defect, there are a lot of stages I go through. Today we’ll look at:

  • Make a minimal reproducer
  • Determine the exact code path that causes the defect
  • Understand what went wrong at the level of the code
  • Fix it
  • Do a root cause analysis
  • Sum up the lessons learned

Of course if this were a product there would be more steps in there, such as code review of the fix, writing regression tests, getting the regressions code reviewed, adding assertions that verify the violated invariant elsewhere, reviewing related code for similar defects, determining if customers are affected by the bug or the fix, and so on. Obviously we won’t go through those here!


It took me some time to find a minimal repro; I won’t go through how I did so because it was tedious trial and error; suffice to say that I found a very complicated reproducer by accident and then made it smaller and smaller until I could go no further.

The minimal repro is very straightforward; we load this pattern into an empty QuickLife instance and step it two ticks. There is only one active Quad4. The even generation of the active Quad4 looks like this:

The first step we are going from even generation zero to odd generation one; recall that in QuickLife, the odd generation is offset by one cell in each direction, so odd generation one should look like this:

That’s a still Life, so if we’ve done things correctly, even generation two should look like:

But we have not done things correctly, and it does not look like this! Instead we observe that generation two is displayed as being the same as generation zero and that moreover, this pattern continues on subsequent even generations. We have a little isolated one-cell blinker, which should be impossible.

(ASIDE: This is the simplest version of this defect; while investigating I also found more complex versions where the “spark” part was in an adjacent Quad4 and the “block” part was not a still Life. Going through the more complex scenarios would make this long post much longer but both the root cause and the fix are the same, so I will skip that part of the analysis.)


What code path leads to this defect?

First, a quick refresher. Recall that QuickLife gets its speed in several ways. The core logic that we are concerned about for this bug is:

  • A Quad4 represents two generations of a 16×16 fragment of the board; one even, one odd.
  • Each of the eight Quad3s in a Quad4 keeps track of four “regions” and whether those regions are active, stable or dead. By “stable” we mean “exactly the same as two ticks previous”, and by “dead” we mean “stable, and every cell is dead”.
  • If a Quad3 in an active Quad4 is stable or dead then we can skip computing its next generation because it will be the same as the previous generation.

The code I wrote to implement this earlier in the series is usually correct, but it is subtly wrong on generations zero and one, which of course can then cause the computation to be wrong on every subsequent generation. Let’s trace through it carefully and see where things go pear shaped.

Initial state:

  • The even half of the Quad4 cell state is as given in the first diagram.
  • The odd half of the Quad4 cell state is all dead.
  • All even and odd regions of all Quad3s are set to “active”.

To step from generation zero to generation one, remember we execute this code:

private void StepEven()
{
  foreach (Quad4 c in active)
  {
    if (!RemoveStableEvenQuad4(c))
    {
      c.StepEven();
      MakeOddNeighborsActive(c);
    }
    c.StayActiveNextStep = false;
  }
}

The even half of the Quad4 is marked active so we do not remove it, and enter the body of the condition:

public void StepEven()
{
  StepEvenNW();
  StepEvenSW();
  StepEvenNE();
  StepEvenSE();
}

Things are going to go wrong in the NE Quad3, so let’s take a look at that.

private void StepEvenNE()
{
  if (OddNEPossiblyActive())
  {
    Quad3 newOddNE = Step9Quad2ToQuad3Even(...);
    OddNEState = oddNE.UpdateOddQuad3State(newOddNE, OddNEState);
    oddNE = newOddNE;
  }
  else 
    OddNEState = oddNE.MakeOddStableOrDead(OddNEState);
}

The first thing we do is check if the odd cycle of the NE Quad3 could possibly change from its current state; since the even cycle is marked as active, it could. We enter the consequence of the condition and correctly compute that the odd NE Quad3 is all dead. There is only one isolated living cell there, and that’s not enough to sustain life.

Here’s the first thing that goes terribly wrong. UpdateOddQuad3State compares the new odd NE Quad3 to the previous one and discovers both of them are “all dead”. Remember our definition of the state bits: in order for a Quad3 region to be marked as “dead” it must be (1) all dead, and (2) stable; all dead twice. We therefore mark the odd NE Quad3 as “dead” in all four regions.

It seems like everything is working correctly so far. I won’t go through the rest of the even-to-odd step in detail; summing up:

  • All four Quad3s on the even cycle are still marked as active in all regions
  • The odd SW Quad3 is marked as active overall and on its east edge, because it is different from how it was in the “previous” generation.
  • The odd NE, NW and SE Quad3s are marked as “dead” in all regions.

Now let’s look at what happens on the generation-one-to-generation-two step, just in the northeast Quad3 of this Quad4:

private void StepOddNE()
{
  if (EvenNEPossiblyActive())
  {
    Quad3 newEvenNE = Step9Quad2ToQuad3Odd(...);
    EvenNEState = evenNE.UpdateEvenQuad3State(newEvenNE, EvenNEState);
    evenNE = newEvenNE;
  }
  else
    EvenNEState = evenNE.MakeEvenStableOrDead(EvenNEState);
}

Once again, the first thing we check is whether there is any point in recomputing the even state; if we know it is going to be the same this time as last, then we can skip it… and… wait a minute:

private bool EvenNEPossiblyActive() =>
  OddNortheastOrBorderingActive ||
  N != null && N.OddSouthEdge10EastActive;

We just marked the odd NE, NW and SE Quad3s as “stably dead in all regions” so OddNortheastOrBorderingActive is false; there is no Quad4 to the north, but if there were, it would be all dead also. The wrong conclusion we reach is: on the next tick, the northeast corner of this Quad3 must be the same on generation 2 as it was on generation 0 so we can skip computing it.

We therefore enter the alternative (else) branch, call MakeEvenStableOrDead, and mark the even NE Quad3 as “stable”. This is obviously wrong, and worse, that wrongness then persists forever because the whole Quad4 will soon be marked as “stable” and we will have a period-two oscillator between the states illustrated in the first two diagrams above.


The appropriate fix is determined by understanding what has really gone wrong at the level of the code. What invariant did we depend upon that was violated?

If we’re stepping from an even generation to an odd generation, the current generation is stored in the even Quad3s, and the previous generation is stored in the odd Quad3s. But there is an important assumption made by these optimizations: we assume that the current generation was itself computed by stepping the previous generation. The reasoning is:

  • Generation -1 was all dead in the NE Quad3
  • Generation 0 was the result of stepping generation -1 — uh oh
  • Stepping generation 0 gives us generation 1 all dead in the NE Quad3
  • Generations -1 and 1 are the same in the NE Quad3; it is stable
  • Therefore generation 2 is also stable
  • Generation 2 is the same as generation 0 in the NE Quad3 so we can skip computing it and mark it as stable.

The second premise is false, so the conclusion does not follow.

There are a lot of ways to fix this. What we’re going to do here is keep track of one more piece of board-global state: was the current generation actually computed from the previous generation? Or put another way, is the stored cell state of the previous generation correct? The vast majority of the time it will be, but it is not if we have just loaded a pattern in.

If the previous generation is correct then our algorithm is already correct (I hope!) and we do not need to do anything. What should we do if the previous generation is not correct?

The tempting thing to do is to make that flag a parameter to the step functions. Where did things first go wrong? When we marked the odd NE Quad3 as “stably dead”. We could pass a flag in that says “when stepping even to odd, if the previous odd generation is incorrect then do not compare the new odd cells to the previous odd cells; instead mark them all active.” On the next step we would then not skip computing the new even state, and all would be well.

However, there are now eight code paths that we would have to plumb this flag through. An easier, hackier solution — the solution Hensel applied in the original QuickLife source code — is to allow the step to compute the “wrong” stability flags and then fix them later:

private void StepEven()
{
  foreach (Quad4 c in active)
  {
    if (!RemoveStableEvenQuad4(c))
    {
      c.StepEven();
      MakeOddNeighborsActive(c);
    }
    c.StayActiveNextStep = false;
    if (!previousCorrect)
      c.SetOddQuad4AllRegionsActive();
  }
  previousCorrect = true;
}

And similarly on the odd cycle. That’s our fix.


Whenever I fix a defect, I try to spend some time asking myself how this defect came to be in the code in the first place. In this particular case, that’s easy. I remember very clearly the faulty thought process.

As I’ve mentioned before, Hensel’s explanation of the basic ideas underlying the original QuickLife implementation is very clear, but the code that implements it is arcane in many ways. There are loop bodies and conditional bodies with hundreds of lines of dense, bit-twiddling code. The bit manipulation is all raw hex numbers. The variable naming conventions are obscure; p means even, q means odd, for instance. It took me a long time to understand the details of the algorithm, and I wanted to simplify my task by ignoring anything that wasn’t directly related to the actual optimizations.

One of the nice-to-have features of QuickLife (and several other algorithms we’ve looked at) that I did not spend any time in this series addressing is: because we have the previous and current states stored, you could easily add a handy “step backwards one tick” feature to the UI. And in fact the original Java implementation of QuickLife has this feature.

Now, you have to be careful about this, for two reasons. First, obviously once you have stepped backwards once, you cannot do so again. You need to keep track of whether you’ve already stepped back once or not and disallow it. But there is a bigger problem which, given the preceding material in this ever-longer post, you have undoubtedly already deduced. If we were on an even generation, then the current state is in the even state and the previous state is in the odd state. If we step back one tick, then the current state is in the odd state, but the even state is not the previous state; it is the next state. We need to make sure that when we do the odd-to-even step, we do not come to the erroneous conclusion that the even state is all stable!

The original code has a very clearly-named state variable; it has a field backcorrect which indicates whether the previous state is correct or not. If set to false, then the algorithm does exactly what I’ve done in my bug fix: upon stepping, it marks every region of every Quad3 to “active”, and then finally sets the correctness flag to true.

My mistake was that I wrongly believed upon initially reading the code that I could ignore backcorrect because it was only for implementing the step-backwards feature, which I was not going to add to my UI. I completely missed the fact that Hensel sets this flag to false whenever the user forces a change to cell state, and that setting that flag to false is crucial for ensuring the correctness of the algorithm in that scenario.

I feel a keen sense of irony that after figuring out all the obscure and arcane parts of the algorithm, I missed the importance of this state bit that was more clearly named than all the other state bits.

That was the root cause of the defect, but there were other contributory factors:

  • I made the classic high school math mistake of reasoning inductively without establishing a base case. In this blog series I made arguments for the correctness of this algorithm based on the assumption that the previous generation was the real previous generation of the current generation. But that assumption breaks down on the base case of generation zero.
  • My test regimen was inadequate. I have no unit tests or end-to-end tests; all my testing for this entire series has been completely ad-hoc-try-it-out-in-the-UI-and-see-if-it-looks-right.
  • I have very few assertions in the code.
  • When implementing a complex and easily misunderstood algorithm like QuickLife, I probably should have built a “check mode” implementation of the ILife interface that takes two implementations, does all the same operations to both, and verifies that the new board states are the same in both implementations. I could then have written a “random soup” test generator and verified my implementations against each other.

I (re)learned some lessons from this bug. From the specific to the general:

  • When you’re porting code you didn’t write and there is a part you think is unnecessary to port, really dig in and understand whether it can be safely removed.
  • I made good arguments about the correctness of the “steady state”, but I did not establish the correctness of the code on “boundary” conditions such as “it’s the very first step”. Edge cases are not necessarily rare cases; every program starts at an edge case.
  • A lesson I have learned many times in my life as a compiler developer is strongly reinforced by this bug: every time you cache the result of an analysis of a mutable data structure for performance reasons you create an opportunity for a bug should a mutation cause the cached analysis to become incorrect. I’ve fixed so many compiler bugs like that.
  • The whole point of this series is that you can sometimes find specific aspects of the “business domain” of your computation and use those to drive performance optimizations. If those optimizations depend on invariants that can be forced to be violated, the optimization will lead to an incorrect computation!

We took advantage of the rules of Life to get wins; when we force those rules to be violated, computations that depend on those rules become incorrect.


Next time on FAIC: Let’s finish this thing up! Are there patterns that grow quadratically?

Introducing Bean Machine

The final part of my Life series is still in the works but I need to interrupt that series with some exciting news. The new programming language I have been working on for the last year or so has just been announced by the publication of our paper Bean Machine: A Declarative Probabilistic Programming Language For Efficient Programmable Inference


Before I get into the details, a few notes on attributing credit where it is due and the like:

  • Though my name appears on the paper as a courtesy, I did not write this paper. Thanks and congratulations in particular to Naz Tehrani and Nim Arora who did a huge amount of work getting this paper together.
  • The actual piece of the language infrastructure that I work on every day is a research project involving extraction, type analysis and optimization of the Bayesian network underlying a Bean Machine program. We have not yet announced the details of that project, but I hope to be able to discuss it here soon.
  • Right now we’ve only got the paper; more information about the language and how to take it out for a spin yourself will come later. It will ship when its ready, and that’s all the scheduling information I’ve got.
  • The name of the language comes from a physical device for visualizing probability distributions because that’s what it does.


I will likely do a whole series on Bean Machine later on this autumn, but for today let me just give you the brief overview should you not want to go through the paper. As the paper’s title says, Bean Machine is a Probabilistic Programming Language (PPL).

For a detailed introduction to PPLs you should read my “Fixing Random” series, where I show how we could greatly improve support for analysis of randomness in .NET by both adding types to the base class library and by adding language features to a language like C#.

If you don’t want to read that 40+ post introduction, here’s the TLDR.

We are all used to two basic kinds of programming: produce an effect and compute a result. The important thing to understand is that Bean Machine is firmly in the “compute a result” camp. In our PPL the goal of the programmer is to declaratively describe a model of how the world works, then input some observations of the real world in the context of the model, and have the program produce posterior distributions of what the real world is probably like, given those observations. It is a language for writing statistical model simulations.

A “hello world” example will probably help. Let’s revisit a scenario I first discussed in part 30 of Fixing Random: flipping a coin that comes from an unfair mint. That is, when you flip a coin from this mint, you do not necessarily have a 50-50 chance of getting heads vs tails. However, we do know that when we mint a coin, the distribution of fairness looks like this:

Fairness is along the x axis; 0.0 means “always tails”, 1.0 means “always heads”. The probability of getting a coin of a particular fairness is proportional to the area under the graph. In the graph above I highlighted the area between 0.6 and 0.8; the blue area is about 25% of the total area under the curve, so we have a 25% chance that a coin will be between 0.6 and 0.8 fair.

Similarly, the area between 0.4 and 0.6 is about 30% of the total area, so we have a 30% chance of getting a coin whose fairness is between 0.4 and 0.6. You see how this goes I’m sure.

Suppose we mint a coin; we do not know its true fairness, just the distribution of fairness above. We flip the coin 100 times, and we get 72 heads, 28 tails. What is the most probable fairness of the coin?

Well, obviously the most probable fairness of a coin that comes up heads 72 times out of 100 is 0.72, right?

Well, no, not necessarily right. Why? Because the prior probability that we got a coin that is between 0.0 and 0.6 is rather a lot higher than the prior probability that we got a coin between 0.6 and 1.0. It is possible by sheer luck to get 72 heads out of 100 with a coin between 0.0 and 0.6 fairness, and those coins are more likely overall.


Aside: If that is not clear, try thinking about an easier problem that I discussed in my earlier series. You have 999 fair coins and one double-headed coin. You pick a coin at random, flip it ten times and get ten heads in a row. What is the most likely fairness, 0.5 or 1.0? Put another way: what is the probability that you got the double-headed coin? Obviously it is not 0.1%, the prior, but nor is it 100%; you could have gotten ten heads in a row just by luck with a fair coin. What is the true posterior probability of having chosen the double-headed coin given these observations?


What we have to do here is balance between two competing facts. First, the fact that we’ve observed some coin flips that are most consistent with 0.72 fairness, and second, the fact that the coin could easily have a smaller (or larger!) fairness and we just got 72 heads by luck. The math to do that balancing act to work out the true distribution of possible fairness is by no means obvious.

What we want to do is use a PPL like Bean Machine to answer this question for us, so let’s build a model!

The code will probably look very familiar, and that’s because Bean Machine is a declarative language based on Python; all Bean Machine programs are also legal Python programs. We begin by saying what our “random variables” are.


Aside: Statisticians use “variable” in a way very different than computer programmers, so do not be fooled here by your intuition. By “random variable” we mean that we have a distribution of possible random values; a representation of any single one of those values drawn from a distribution is a “random variable”. 


To represent random variables we declare a function that returns a pytorch distribution object for the distribution from which the random variable has been drawn. The curve above is represented by the function beta(2, 2), and we have a constructor for an object that represents that distribution in the pytorch library that we’re using, so:

@random_variable
def coin():
  return Beta(2.0, 2.0)

Easy as that. Every usage in the program of coin() is logically a single random variable; that random variable is a coin fairness that was generated by sampling it from the beta(2, 2) distribution graphed above.


Aside: The code might seem a little weird, but remember we do these sorts of shenanigans all the time in C#. In C# we might have a method that looks like it returns an int, but the return type is Task<int>; we might have a method that yield returns a double, but the return type is IEnumerable<double>. This is very similar; the method looks like it is returning a distribution of fairnesses, but logically we treat it like a specific fairness drawn from that distribution.


What do we then do? We flip a coin 100 times. We therefore need a random variable for each of those coin flips:

@random_variable
def flip(i):
  return Bernoulli(coin())

Let’s break that down. Each call flip(0), flip(1), and so on on, are distinct random variables; they are outcomes of a Bernoulli process — the “flip a coin” process — where the fairness of the coin is given by the single random variable coin(). But every call to flip(0) is logically the same specific coin flip, no matter how many times it appears in the program.

For the purposes of this exercise I generated a coin and simulated 100 coin tosses to simulate our observations of the real world. I got 72 heads. Because I can peek behind the curtain for the purposes of this test, I can tell you that the coin’s true fairness was 0.75, but of course in a real-world scenario we would not know that. (And of course it is perfectly plausible to get 72 heads on 100 coin flips with a 0.75 fair coin.)

We need to say what our observations are.  The Bernoulli distribution in pytorch produces a 1.0 tensor for “heads” and a 0.0 tensor for “tails”. Our observations are represented as a dictionary mapping from random variables to observed values.

heads = tensor(1.0)
tails = tensor(0.0)
observations = {
  flip(0) : heads,
  flip(1) : tails,
  ...  and so on, 100 times with 72 heads, 28 tails.
}

Finally, we have to tell Bean Machine what to infer. We want to know the posterior probability of fairness of the coin, so we make a list of the random variables we care to infer posteriors on; there is only one in this case.

inferences = [ coin() ]
posteriors = infer(observations, inferences)
fairness = posteriors[coin()]

and we get an object representing samples from the posterior fairness of the coin given these observations. (I’ve simplified the call site to the inference method slightly here for clarity; it takes more arguments to control the details of the inference process.)

The “fairness” object that is handed back is the result of efficiently simulating the possible worlds that get you to the observed heads and tails; we then have methods that allow you to graph the results of those simulations using standard graphing packages:

The orange marker is our original guess of observed fairness: 0.72. The red marker is the actual fairness of the coin used to generate the observations, 0.75. The blue histogram shows the results of 1000 simulations; the vast majority of simulations that produced those 72 heads had a fairness between 0.6 and 0.8, even though only 25% of the coins produced by the mint are in that range.  As we would hope, both the orange and red markers are near the peak of the histogram.

So yes, 0.72 is close to the most likely fairness, but we also see here that a great many other fairnesses are possible, and moreover, we clearly see how likely they are compared to 0.72. For example, 0.65 is also pretty likely, and it is much more likely than, say, 0.85. This should make sense, since the prior distribution was that fairnesses closer to 0.5 are more likely than those farther away; there’s more “bulk” to the histogram to the left than the right: that is the influence of the prior on the posterior!

Of course because we only did 1000 simulations there is some noise; if we did more simulations we would get a smoother result and a clear, single peak. But this is a pretty good estimate for a Python program with six lines of model code that only takes a few seconds to run.


Why do we care about coin flips? Obviously we don’t care about solving coin flip problems for their own sake. Rather, there are a huge number of real-world problems that can be modeled as coin flips where the “mint” produces unfair coins and we know the distribution of coins that come from that mint:

  • A factory produces routers that have some “reliability”; each packet that passes through each router in a network “flips a coin” with that reliability; heads, the packet gets delivered correctly, tails it does not. Given some observations from a real data center, which is the router that is most likely to be the broken one? I described this model in my Fixing Random series.
  • A human reviewer classifies photos as either “a funny cat picture” or “not a funny cat picture”. We have a source of photos — our “mint” — that produces pictures with some probability of them being a funny cat photo, and we have human reviewers each with some individual probability of making a mistake in classification. Given a photo and ten classifications from ten reviewers, what is the probability that it is a funny cat photo? Again, each of these actions can be modeled as a coin flip.
  • A new user is either a real person or a hostile robot, with some probability. The new user sends a friend request to you; you either accept it or reject it based on your personal likelihood of accepting friend requests. Each one of these actions can be modeled as a coin flip; given some observations of all those “flips”, what is the posterior probability that the account is a hostile robot?

And so on; there are a huge number of real-world problems we can solve just with modeling coin flips, and Bean Machine does a lot more than just coin flip models!


I know that was rather a lot to absorb, but it is not every day you get a whole new programming language to explain! In future episodes I’ll talk more about how Bean Machine works behind the scenes, how we traded off between declarative and imperative style, and that sort of thing. It’s been a fascinating journey so far and I can’t hardly wait to share it.

 

Life, part 35

Last time we implemented what looked like Gosper’s algorithm and got a disappointing result; though the quadtree data structure is elegant and the recursive algorithm is simple, and even though we memoize every operation, the time performance is on par with our original naive implementation, and the amount of space consumed by the memoizers is ginormous. But as I said last time, I missed a trick in my description of the algorithm, and that trick is the key to the whole thing. (Code for this episode is here.)

One reader pointed out that we could be doing a better job with the caching. Sure, that is absolutely true. There are lots of ways we could come up with a better cache mechanism than my hastily-constructed dictionary, and those would in fact lead to marginal performance gains. But I was looking for a win in the algorithm itself, not in the details of the cache.

A few readers made the astute observation that the number of recursions — nine — was higher than necessary. The algorithm I gave was:

  • We are given an n-quad and wish to step the center (n-1)-quad.
  • We make nine unstepped (n-1)-quads and step each of them to get nine stepped (n-2)-quads
  • We reform those nine (n-2)-quads into four stepped (n-1)-quads, take the centers of each, and that’s our stepped (n-1) quad.

But we have all the information we need in the original n-quad to extract four unstepped (n-1)-quads. We then could step each of those to get four center stepped (n-2)-quads, and we can reform those into the desired (n-1)-quad.

Extracting those four unstepped (n-1)-quads is a fair amount of work, but there is an argument to be made that it might be worth the extra work in order to go from nine recursions to four. I didn’t try it, but a reader did and reports back that it turns out this is not a performance win. Regardless though, this wasn’t the win I was looking for.

Let’s go through the derivation one more time, and derive Gosper’s algorithm for real.

We still have our base case: we can take any 2-quad and get the center 1-quad stepped one tick forward. Suppose once again we are trying to step the outer green 3-quad forward; we step each of its component green 2-quads forward one tick to get these four blue 1-quads:

We then extract the north, south, east, west and center 2-quads from the 3-quad and step each of those forwards one tick, and that gives us these nine blue 1-quads, each one step in the future:

 

We then form four 2-quads from those nine 1-quads; here we are looking at the northwest 2-quad and its center 1-quad:

The light blue 2-quad and its dark blue 1-quad center are both one tick ahead of the outer green 3-quad. This is where we missed our trick.

We have the light blue 2-quad, and it is one tick ahead of the green 3-quad. We want to get its center 1-quad. What if we got its center 1-quad stepped one tick ahead? We know we can do it! It’s a 2-quad and we can get the center 1-quad of any 2-quad stepped one tick ahead. We can make the innermost dark blue quad stepped two ticks ahead. We repeat that operation four times and we have enough information to construct…

…the center 2-quad stepped two ticks ahead, not one.

Now let’s do the same reasoning for a 4-quad.

We step its nine component 3-quads forwards two ticks, because as we just saw, we can do that for a 3-quad. We then compose those nine 2-quads into four 3-quads, step each of those forward two ticks, again because we can, and construct the center 3-quad stepped four ticks ahead.

And now let’s do the same reasoning for an n-quad… you see where this is going I’m sure.

This is the astonishing power of Gosper’s algorithm. Given an n-quad, we can step forward its center (n-1)-quad by 2n-2 ticks for any n>=2.

Want to know the state of the board a million ticks in the future? Embiggen the board until it is a 22-quad — we know that operation is cheap and easy — and you can get the center 21-quad stepped forwards by 220 ticks using this algorithm. A billion ticks? Embiggen it to a 32-quad, step it forward 230 ticks.

We showed last time an algorithm for stepping an n-quad forward by one tick; here we’ve sketched an algorithm for stepping an n-quad forward by 2n-2 ticks. What would be really nice from a user-interface perspective is if we had a hybrid algorithm that can step an n-quad forward by 2k ticks for any k between 0 and n-2.

You may recall that many episodes ago I added an exponential “speed factor” where the factor is the log2 of the number of ticks to step. We can now write an implementation of Gosper’s algorithm for real this time that takes a speed factor. Rather than try to explain it further, let’s just look at the code.

private static Quad UnmemoizedStep((Quad q, int speed) args)
{
  Quad q = args.q;
  int speed = args.speed;

  Debug.Assert(q.Level >= 2);
  Debug.Assert(speed >= 0);
  Debug.Assert(speed <= q.Level - 2);

  Quad r;
  if (q.IsEmpty)
    r = Quad.Empty(q.Level - 1);
  else if (speed == 0 && q.Level == 2)
    r = StepBaseCase(q);
  else
  {
    // The recursion requires that the new speed be not
    // greater than the new level minus two. Decrease speed
    // only if necessary.
    int nineSpeed = (speed == q.Level - 2) ? speed - 1 : speed;
    Quad q9nw = Step(q.NW, nineSpeed);
    Quad q9n = Step(q.N, nineSpeed);
    Quad q9ne = Step(q.NE, nineSpeed);
    Quad q9w = Step(q.W, nineSpeed);
    Quad q9c = Step(q.Center, nineSpeed);
    Quad q9e = Step(q.E, nineSpeed);
    Quad q9sw = Step(q.SW, nineSpeed);
    Quad q9s = Step(q.S, nineSpeed);
    Quad q9se = Step(q.SE, nineSpeed);
    Quad q4nw = Make(q9nw, q9n, q9c, q9w);
    Quad q4ne = Make(q9n, q9ne, q9e, q9c);
    Quad q4se = Make(q9c, q9e, q9se, q9s);
    Quad q4sw = Make(q9w, q9c, q9s, q9sw);

    // If we are asked to step forwards at speed (level - 2), 
    // then we know that the four quads we just made are stepped 
    // forwards at (level - 3). If we step each of those forwards at 
    // (level - 3) also, then we have the center stepped forward at 
    // (level - 2), as desired.
    //
    // If we are asked to step forwards at less than speed (level - 2)
    // then we know the four quads we just made are already stepped
    // that amount, so just take their centers.

    if (speed == q.Level - 2)
    {  
      Quad rnw = Step(q4nw, speed - 1);
      Quad rne = Step(q4ne, speed - 1);
      Quad rse = Step(q4se, speed - 1);
      Quad rsw = Step(q4sw, speed - 1);
      r = Make(rnw, rne, rse, rsw);
    }
    else
    {
      Quad rnw = q4nw.Center;
      Quad rne = q4ne.Center;
      Quad rse = q4se.Center;
      Quad rsw = q4sw.Center;
      r = Make(rnw, rne, rse, rsw);
    }
  }
  Debug.Assert(q.Level == r.Level + 1);
  return r;
}

As I’m sure you’ve guessed, yes, we’re going to memoize this too! This power has not come for free; we are now doing worst case 13 recursions per non-base call, which means that we could be doing worst case 13n-3 base case calls in order to step forwards 2n-2 ticks, and that’s a lot of base case calls. How on earth is this ever going to work?

Again, because (1) we are automatically skipping empty space of every size; if we have an empty 10-quad that we’re trying to step forwards 256 ticks, we immediately return an empty 9-quad, and (2) thanks to memoization every time we encounter a problem we’ve encountered before, we just hand back the solution. The nature of Life is that you frequently encounter portions of boards that you’ve seen before because most of a board is stable most of the time. We hope.

That’s the core of Gosper’s algorithm, finally. (Sorry it took 35 episodes to get there, but it was a fun journey!) Let’s now integrate that into our existing infrastructure; I’ll omit the memoization and cache management because it’s pretty much the same as we’ve seen already.

The first thing to note is that we can finally get rid of this little loop:

public void Step(int speed)
{
  for (int i = 0; i < 1L << speed; i += 1)
    Step();
}

Rather than implementing Step(speed) in terms of Step(), we’ll go the other way:

public void Step()
{
  Step(0);
}

public void Step(int speed)
{
  // Cache management omitted
  const int MaxSpeed = MaxLevel - 2;
  Debug.Assert(speed >= 0);
  Debug.Assert(speed <= MaxSpeed);

The embiggening logic needs to be a little more aggressive. This implementation is probably more aggressive than we need it to be, but remember, empty space is essentially free both in space and processing time.

  Quad current = cells;
  if (!current.HasAllEmptyEdges)
    current = current.Embiggen().Embiggen();
  else if (!current.Center.HasAllEmptyEdges)
    current = current.Embiggen();
  while (current.Level < speed + 2)
    current = current.Embiggen();

  Quad next = Step(current, speed);
  cells = next.Embiggen();
  generation += 1L << speed;
  // Cache reset logic omitted
}

Now how are we going to perf test this thing? We already know that calculating 5000 individual generations of “acorn” with Gosper’s algorithm will be as slow as the original naïve version. What happens if for our performance test we set up acorn and then call Step(13)? That will step it forwards 8196 ticks:

Algorithm           time(ms) size  Mcells/s 
Naïve (Optimized):   4000     8      82     
Abrash (Original)     550     8     596     
Stafford              180     8    1820     
QuickLife              65    20      ?      
Gosper, sp 0 * 5000  3700    60      ?
Gosper, sp 13 * 1     820    60      ?

Better, but still not as good as any of our improvements over the naïve algorithm, and 13x slower than QuickLife.

So this is all very interesting, but what’s the big deal?

Do you remember the asymptotic time performance of Hensel’s QuickLife? It was O(changes); that is, the cost of computing one tick forwards is proportional to the number of changed cells on that tick. Moreover, period-two oscillators were essentially seen as not changing, which is a huge win.

We know that the long-term behaviour of acorn is that shortly after 5000 ticks in, we have only a handful of gliders going off to infinity and all the rest of the living cells are either still Lifes or period-two oscillators that from QuickLife’s perspective, might as well be still Lifes. So in the long run, the only changes that QuickLife has to process are the few dozens of cells changed for each glider; everything else gets moved into the “stable” bucket.

Since in the long run QuickLife is processing the same number of changes per tick, we would expect that the total time taken to run n ticks of acorn with QuickLife should grow linearly. Let’s actually try it out to make sure. I’m going to run one ticks of acorn with QuickLife, then reset, then run two ticks , then reset, then run four ticks, reset, eight ticks, and so on, measuring the time for each, up to 221 =~ 2.1 million ticks.

Here is a graph of the results; milliseconds on the y axis, ticks on the x axis, log-log scale. Lower is faster.


Obviously the leftmost portion of the graph is wrong; anything less than 256 ticks takes less than 1 millisecond but I haven’t instrumented my implementation to measure sub-millisecond timings because I don’t care about those. I’ve just marked all of them as taking one millisecond.

Once we’re over a millisecond, you can see that QuickLife’s time to compute some number of ticks grows linearly; it’s about 8 microseconds per tick, which is pretty good. You can also see that the line changes slope slightly once we get to the point where it is only the gliders on the active list; the slope gets shallower, indicating that we’re taking less time for each tick.

Now let’s do the same with Gosper’s algorithm; of course we will make sure to reset the caches between every run! Otherwise we would be unfairly crediting speed improvements in later runs to cached work that was done in earlier runs.

Hensel’s QuickLife in blue, Gosper’s HashLife in orange:

Holy goodness! 

The left hand side of the graph shows that Gosper’s algorithm is consistently around 16x slower than QuickLife in the “chaos” part of acorn’s evolution, right up to the point where we end up in the “steady state” of just still Lifes, period-two oscillators and gliders. The right hand side of the graph shows that once we are past that point, Gosper’s algorithm becomes O(1), not O(changes).

In fact this trend continues. We can compute a million, a billion, a trillion, a quadrillion ticks of acorn in around 800ms. And we can embiggen the board to accurately track the positions of those gliders even when they are a quadrillion cells away from the center.

What is the takeaway here? The whole point of this series is: you can take advantage of characteristics of your problem space to drive performance improvements. But what we’ve just dramatically seen here is that this maxim is not sufficient. You’ve also got to think about specific problems that you are solving.

Let’s compare and contrast. Hensel’s QuickLife algorithm excels when:

  • All cells of interest fit into a 20-quad
  • There is a relatively small number of living cells (because memory burden grows as O(living)
  • You are making a small number of steps at a time
  • Living cells are mostly still Lifes or period-two oscillators; the number of “active” Quad4s is relatively small

Gosper’s HashLife algorithm excels when:

  • Boards must be of unlimited size
  • Regularity in space — whether empty space or not — allows large regions to be deduplicated
  • You are making a large number of steps at a time
  • Regularity in time allows for big wins by caching and re-using steps we’ve seen already.
  • You’ve got a lot of memory! Because the caches are going to get big no matter what you do.

That’s why Gosper’s algorithm is so slow if you run in on the first few thousand generations of acorn; that evolution is very chaotic and so there are a lot of novel computations to do and comparatively less re-use. Once we’re past the chaotic period, things become very regular in both time and space, and we transition to a constant-time performance.


That is the last algorithm I’m going to present but I have one more thing to discuss in this series.

Next time on FAIC: we will finally answer the question I have been teasing all this time: are there patterns that grow quadratically? And how might our two best algorithms handle such scenarios?

 

Life, part 34

All right, we have our quad data structure, we know how to get and set individual elements, and we know how to display it. We’ve deduplicated it using memoization. How do we step it forward one tick? (Code for this episode is here.)

Remember a few episodes ago when we were discussing QuickLife and noted that if you have a 2-quad in hand, like these green ones, you can get the state of the blue 1-quad one tick ahead? And in fact we effectively memoized that solution by simply precomputing all 65536 cases.

The QuickLife algorithm memoized only the 2-quad-to-center-1-quad step algorithm; we’re going to do the same thing but with even more memoization. We have a recursively defined quad data structure, so it makes sense that the step algorithm will be recursive. We will use 2-quad-to-1-quad as our base case.

For the last time in this series, let’s write the Life rule:

private static Quad Rule(Quad q, int count)
{
  if (count == 2) return q;
  if (count == 3) return Alive;
  return Dead;
}

We’ll get all sixteen cells in the 2-quad as numbers:

private static Quad StepBaseCase(Quad q)
{
  Debug.Assert(q.Level == 2);
  int b00 = (q.NW.NW == Dead) ? 0 : 1;
  ... 15 more omitted ...

and count the neighbours of the center 1-quad:

  int n11 = b00 + b01 + b02 + b10 + b12 + b20 + b21 + b22;
  int n12 = b01 + b02 + b03 + b11 + b13 + b21 + b22 + b23;
  int n21 = b11 + b12 + b13 + b21 + b23 + b31 + b32 + b33;
  int n22 = b10 + b11 + b12 + b20 + b22 + b30 + b31 + b32;
  return Make(
    Rule(q.NW.SE, n11),
    Rule(q.NE.SW, n12),
    Rule(q.SE.NW, n21),
    Rule(q.SW.NE, n22));
}

We’ve seen this half a dozen times before. The interesting bit comes in the recursive step. The key insight is: for any n>=2, if you have an n-quad in hand, you can compute the (n-1) quad in the center, one tick ahead.

How? We’re going to use almost the same technique that we used in QuickLife. Remember in QuickLife the key was observing that if we had nine Quad2s in the old generation, we could compute a Quad3 in the new generation with sixteen steps on component Quad2s. The trick here is almost the same. Let’s draw some diagrams.

Suppose we have the 3-quad from the image above. We compute the next generation of its four component 2-quads; the green quads are current, the blue are stepped one ahead.

We can use a similar trick as we used with QuickLife to get the north, south, east, west and center 2-quads of this 3-quad, and move each of them ahead one step to get five more 1-quads. I’ll draw the original 3-quad in light green, and we can extract component 2-quads from it that I’ll draw in dark green. We then move each of those one step ahead to get the blue 1-quads.

That gives us this information:

We then make four 2-quads from those nine: and extract the center 1-quad from each using the Center function (source code below). I’ll just show the northwest corner; you’ll see how this goes. We make the light blue 2-quad out of four of the blue 1-quads, and then the center 1-quad of that thing is:

We do that four times and from those 1-quads we construct the center 2-quad moved one step ahead:

Summing up the story so far:

  • We can take a 2-quad forward one tick to make a 1-quad with our base case.
  • We’ve just seen here that we can use that fact to take a 3-quad forward one tick to make a 2-quad stepped forward one tick.
  • But nothing we did in the previous set of steps depended on having a 3-quad specifically. Assume that for some n >= 2 we can move an n-quad forward one tick to make an (n-1) quad; we have above an algorithm where we use that assumption and can move an (n+1)-quad forward to get an n-quad.

That is, we can move a 2-quad forward with our base case; moving a 3-quad forward requires the ability to move a 2-quad forward. Moving a 4-quad forward requires the ability to move a 3-quad forward, and so on.

As I’ve said many times on this blog, every recursive algorithm is basically the same. If we’re in the base case, solve the problem directly. If we’re not in the base case, break up the problem into finitely many smaller problems, solve each, and use the solutions to solve the larger problem.

Let’s write the code to move any n-quad for n >= 2 forward one tick.

We’ll need some helper methods that extract the five needed sub-quads, but those are easily added to Quad. (Of course these helpers are only valid when called on a 2-quad or larger.)

public Quad Center => Make(NW.SE, NE.SW, SE.NW, SW.NE);
public Quad N => Make(NW.NE, NE.NW, NE.SW, NW.SE);
public Quad E => Make(NE.SW, NE.SE, SE.NE, SE.NW);
public Quad S => Make(SW.NE, SE.NW, SE.SW, SW.SE);
public Quad W => Make(NW.SW, NW.SE, SW.NE, SW.NW);

And then I’ll make a static method that takes a Quad and returns the center stepped forward one tick. (Why not an instance method on Quad? We will see in a moment.)

private static Quad Step(Quad q)
{
  Debug.Assert(q.Level >= 2);
  Quad r;
  if (q.IsEmpty)
    r = Empty(q.Level - 1);
  else if (q.Level == 2)
    r = StepBaseCase(q);
  else
  {
    Quad q9nw = Step(q.NW);
    Quad q9n = Step(q.N);
    Quad q9ne = Step(q.NE);
    Quad q9w = Step(q.W);
    Quad q9c = Step(q.Center);
    Quad q9e = Step(q.E);
    Quad q9sw = Step(q.SW);
    Quad q9s = Step(q.S);
    Quad q9se = Step(q.SE);
    Quad q4nw = Make(q9nw, q9n, q9c, q9w);
    Quad q4ne = Make(q9n, q9ne, q9e, q9c);
    Quad q4se = Make(q9c, q9e, q9se, q9s);
    Quad q4sw = Make(q9w, q9c, q9s, q9sw);
    Quad rnw = q4nw.Center;
    Quad rne = q4ne.Center;
    Quad rse = q4se.Center;
    Quad rsw = q4sw.Center;
    r = Make(rnw, rne, rse, rsw);
  }
  Debug.Assert(q.Level == r.Level + 1);
  return r;
}

Well that was easy! We just do nine recursions and then reorganize the resulting nine one-tick-forward quads until we get the information we want, and return it. (I added a little easy out for the empty case, though strictly speaking that is not necessary.)

There are probably three things on your mind right now.

  • If we get a full quad-size smaller every time we step, we’re going to get down to a very small board very quickly!
  • QuickLife memoized the step-the-center-of-a-2-quad operation. Why aren’t we doing the same thing here?
  • Nine recursions is a lot; isn’t this going to blow up performance? Suppose we have an 8-quad; we do nine recursions on 7-quads, but each of those does nine recursions on 6-quads, and so on down to 3-quads. It looks like we are doing 9n-2 calls to the base case when stepping an n-quad forward one tick.

First things first.

When do we not care if we’re shrinking an n-quad down to an (n-1)-quad on step? When all living cells in the n-quad are already in the center (n-1)-quad. But that condition is easy to achieve! Let’s actually write our public step method, not just the helper that steps a quad. And heck, let’s make sure that we have more than enough empty space. Remember, empty space is super cheap. 

sealed class Gosper : ILife, IDrawScale, IReport
{
  private Quad cells;
  private long generation;
  ...
  public void Step()
  {
    Quad current = cells;

Make cells bigger until there are two “levels” of empty cells surrounding the center. (We showed Embiggen last time.) That way we are definitely not throwing away any living cells when we get a next state that is one size smaller:

    if (!current.HasAllEmptyEdges)
      current = current.Embiggen().Embiggen();
    else if (!current.Center.HasAllEmptyEdges)
      current = current.Embiggen();
    Quad next = Step(current);

We’ve stepped, so next is one size smaller than current. Might as well make it bigger too; that’s one fewer thing to deal with next time. Again, empty space is cheap.

  cells = next.Embiggen();
  generation += 1;
}

HasAllEmptyEdges is an easy helper method of Quad:

public bool HasAllEmptyEdges => 
  NW.NW.IsEmpty &&
  NW.NE.IsEmpty &&
  NE.NW.IsEmpty &&
  NE.NE.IsEmpty &&
  NE.SE.IsEmpty &&
  SE.NE.IsEmpty &&
  SE.SE.IsEmpty &&
  SE.SW.IsEmpty &&
  SW.SE.IsEmpty &&
  SW.SW.IsEmpty &&
  SW.NW.IsEmpty &&
  NW.SW.IsEmpty;

That was an easy problem to knock down. Second problem: QuickLife memoized the 2-quad-to-1-quad step algorithm and got a big win; shouldn’t we do the same thing?

Sure, we have a memoizer, we can do so easily. But… what about our third problem? We have a recursive step that is creating exponentially more work as the quad gets larger as it traverses our deduplicated tree structure.

Hmm.

It is recursing on a deduplicated structure, which means it is probably going to encounter the same problems several times. If we move a 3-quad forward one step to get a 2-quad, we’re going to get the same answer the second time we do the same operation on the same 3-quad. If we move a 4-quad forward one step to get a 3-quad, we will get the same answer the second time we do it. And so on. Let’s just memoize everything.

We’ll rename Step to UnmemoizedStep, create a third memoizer, and replace Step with:

private static Quad Step(Quad q) => 
  CacheManager.StepMemoizer.MemoizedFunc(q);

And now we have solved our second and third problems with one stroke.

Let’s run it! We’ll do our standard performance test of 5000 generations of acorn:

Algorithm           time(ms) size  Mcells/s 
Naïve (Optimized):   4000     8      82     
Abrash (Original)     550     8     596     
Stafford              180     8    1820     
QuickLife              65    20      ?      
Gosper v1            3700    60      ?

Oof.

It’s slow! Not as slow as the original naïve implementation, but just about.

Hmm.

That’s the time performance; what’s the memory performance? There’s a saying I’ve used many times; I first heard it from Raymond Chen but I don’t know if he coined it or was quoting someone else. “A cache without an expiration policy is called a memory leak”. Memory leaks can cause speed problems as well as memory problems because they increase burden on the garbage collector, which can slow down the whole system. Also, dictionaries are in theory O(1) access — that is, access time is the same no matter how big the cache gets — but theory and practice are often different as the dictionaries get very large.

How much memory are we using in this thing? The “empty” memoizer never has more than 61 entries, so we can ignore it. I did some instrumentation of the “make” and “step” caches; after 5000 generations of acorn:

  • the step and make caches both have millions of entries
  • half the entries were never read at all, only written
  • 97% of the entries were read fewer than twenty times
  • the top twenty most-read entries account for 40% of the total reads

This validates our initial assumption that there is a huge amount of regularity; the “unusual” situations recur a couple dozen times tops, and we spend most of our time looking at the same configurations over and over again.

This all suggests that we could benefit from an expiration policy. There are two things to consider: what to throw away, and when to throw it away. First things first:

  • An LRU cache seems plausible; that is, when you think it is time to take stuff out of the cache, take out some fraction of the stuff that has been Least Recently Used. However LRU caches involve making extra data structures to keep track of when something has been used; we do extra work on each step, and it seems like that might have a performance impact given how often these caches are hit.
  • The easiest policy is: throw it all away! Those 20 entries that make up 40% of the hits will be very quickly added back to the cache.

Let’s try the latter because it’s simple. Now, we cannot just throw it all away because we must maintain the invariant that Make agrees with Empty; that is, when we call Make with four empty n-quads and when we call Empty(n+1) we must get the same object out. But if we throw away the “make” cache and then re-seed it with the contents of the “empty” cache — which is only 61 entries, that’s easy — then we maintain that invariant.

When to throw it away? What we definitely do not want to happen is end up in a situation where we are throwing away stuff too often. We can make a very simple dynamically-tuned cache with this policy:

  • Set an initial cache size bound. 100K entries, 1M entries, whatever.
  • Every thousand generations, check to see if we’ve exceeded the cache size bound. If not, we’re done.
  • We’ve exceeded the bound. Throw away the caches, do a single step, and re-examine the cache size; this tells us the cache burden of doing one tick.
  • The new cache size bound is the larger of the current bound and twice the one-tick burden. That way if necessary the size bound gradually gets larger so we do less frequent cache resetting.

The code is straightforward; at the start of Step:

bool resetMaxCache = false;
if ((generation & 0x3ff) == 0)
{
  int cacheSize = CacheManager.MakeQuadMemoizer.Count + 
    CacheManager.StepMemoizer.Count;
  if (cacheSize > maxCache)
  {
    resetMaxCache = true;
    ResetCaches();
  }
}

“ResetCaches” throws away the step cache and resets the make cache to agree with the empty cache; I won’t bother to show it. At the end of Step:

if (resetMaxCache)
{
  int cacheSize = CacheManager.MakeQuadMemoizer.Count + 
    CacheManager.StepMemoizer.Count;
  maxCache = Max(maxCache, cacheSize * 2);
}

All right, let’s run it again!

Algorithm           time(ms) size  Mcells/s 
Naïve (Optimized):   4000     8      82     
Abrash (Original)     550     8     596     
Stafford              180     8    1820     
QuickLife              65    20      ?      
Gosper v2            4100    60      ?

It’s worse. Heck, it is worse than the naive algorithm!

Sure, the top twenty cache entries account for 40% of the hits, and sure, 97% of the entries are hit fewer than twenty times. But the statistic that is relevant here that I omitted is: the top many hundreds of cache entries account for 50% of the hits. We don’t have to rebuild just the top twenty items in the cache to start getting a win from caching again. We take a small but relevant penalty every time we rebuild the caches.

Sigh.

We could keep on trying to improve the marginal performance by improving our mechanisms. We could try an LRU cache, or optimize the caches for reading those top twenty entries, or whatever. We might eke out some wins. But maybe instead we should take a step back and ask if there’s an algorithmic optimization that we missed.


Next time on FAIC: There is an algorithmic optimization that we missed. Can you spot it?

Installing windows

Episode 34 will be delayed again — sorry! — because once again the time I had set aside for writing this weekend got consumed by a real-world task that could not wait. (I will try for Thursday of this week.)

Some friends who are moving had a handyman failure; as is often the case when renovating a house to be sold, they have a set of build dependencies that required this window to be replaced in a hurry in order to not slip the schedule for other renovations, so I volunteered to take care of it.

Yuck.

Living in a 112 year old house myself, I am used to doing archaeological investigations of the strange decisions made by previous owners. This window, though obviously an old single-paned window, did not look like it was original to the 120-year-old house. It was the wrong size for the rough opening; the hinges looked more modern than turn-of-the-century, and so on.

Sure enough, when disassembled there was a gap behind the trim that was insulated with crumpled newspapers from 1957. Oddly enough they were Pittsburgh newspapers from different days; perhaps the owners of the house in 1957 moved from Pittsburgh, replaced a window, and insulated the gaps with the packing paper they moved with? It’s a mystery.

Having zero haircuts since quarantine began has done wonders for my hair.

New window in and trimmed — obviously the paint will need to be redone but that’s why the window had to go in before the painters arrived.

And the interior needs a little more drywalling and priming before it is ready for painting, but it is 1000000x better than before at least.

The neighbours in the blue house apparently asked my friends for my contact information as they also have a window that needs replacing. I am quite chuffed. I had my friends pass along that I only do windows as a favour, but I would be happy to design them a programming language for hire should they need one of those.

Next time: Gosper’s algorithm, finally!

 

Implementing a full fence

Episode 34 of my ongoing series will be slightly delayed because I spent the time on the weekend I normally spend writing instead rebuilding one of my backyard fences.

I forgot to take a before picture, but believe me, it was ruinous when I bought the place in 1997 and to the point in 2020 where it was actually falling to pieces in a stiff breeze.

I replaced it with an identical design and materials:

…and divided and re-homed some sixteen dozen irises in the process.

It’s nice implementing something that requires no typing every now and then.

 

Life, part 33

Last time in this series we learned about the fundamental (and only!) data structure in Gosper’s algorithm: a complete quadtree, where every leaf is either alive or dead and every sub-tree is deduplicated by memoizing the static factory.

Suppose we want to make this 2-quad, from episode 32:

This is easy enough to construct recursively:

Quad q = Make(
  Make(Dead, Alive, Alive, Dead), // NW
  Empty(1), // NE
  Make(Dead, Dead, Alive, Alive), // SE
  Make(Dead, Alive, Alive, Dead)) // SW

Since Make is memoized, the NW and SW corners will be reference-equal.

But suppose we had this quad in hand and we wanted to change one of the cells from dead to alive — how would we do that? Since the data structure is immutable, a change produces a new object; since it is persistent, we re-use as much of the old state as possible. But because we re-use quads arbitrarily often, a quad cannot know its own location. 

The solution is: every time we need to refer to a specific location within a quad, we must also say what the coordinates are in the larger world of the lower left corner cell of that quad.

Let’s start with some easy stuff. Suppose we have a quad (this), the coordinate of its lower-left corner, and a point of interest (p). We wish to know: is p inside this quad or not?

(Source code for this episode is here.)

private bool Contains(LifePoint lowerLeft, LifePoint p) => 
  lowerLeft.X <= p.X && p.X < lowerLeft.X + Width &&
  lowerLeft.Y <= p.Y && p.Y < lowerLeft.Y + Width;

Easy peasy. This enables us to implement a getter; again, this is a method of Quad:

// Returns the 0-quad at point p
private Quad Get(LifePoint lowerLeft, LifePoint p)
{

If the point is outside the quad, assume it is dead.

  if (!Contains(lowerLeft, p))
    return Dead;

The point is inside the quad. Did we find the 0-quad we’re after? Return it!

  if (Level == 0)
    return this;

We did not find it, but it is around here somewhere. It must be in one of the four children, so figure out which one and recurse. Remember, we’ll need to compute the lower left corner of the quad we’re recursing on.

  long w = Width / 2;
  if (p.X >= lowerLeft.X + w)
  {
    if (p.Y >= lowerLeft.Y + w)
      return NE.Get(new LifePoint(lowerLeft.X + w, lowerLeft.Y + w), p);
    else
      return SE.Get(new LifePoint(lowerLeft.X + w, lowerLeft.Y), p);
  } 
  else if (p.Y >= lowerLeft.Y + w)
    return NW.Get(new LifePoint(lowerLeft.X, lowerLeft.Y + w), p);
  else
    return SW.Get(lowerLeft, p);
  }
}

Once we’ve seen the getter, it’s easier to understand the setter. The setter returns a new Quad:

private Quad Set(LifePoint lowerLeft, LifePoint p, Quad q)
{
  Debug.Assert(q.Level == 0);

If the point is not inside the quad, the result is no change.

  if (!Contains(lowerLeft, p))
    return this;

The point is inside the quad. If we are changing the value of a 0-quad, the result is the 0-quad we have in hand.

if (Level == 0)
  return q;

We need to recurse; you might think that the setter logic is going to be a mess right now, but actually it is very straightforward. Since setting a point that is not in range is an immediate identity, we can just recurse four times!

long w = Width / 2;
return Make(
  NW.Set(new LifePoint(lowerLeft.X, lowerLeft.Y + w), p, q),
  NE.Set(new LifePoint(lowerLeft.X + w, lowerLeft.Y + w), p, q),
  SE.Set(new LifePoint(lowerLeft.X + w, lowerLeft.Y), p, q),
  SW.Set(lowerLeft, p, q)) ;
}

That’s the internal logic for getting and setting a 0-quad inside a larger quad. I’ll also add public get and set methods that are just wrappers around these; the details are not particularly interesting.


What about drawing the screen? We haven’t talked about it for a while, but remember the interface that we came up with way back in the early episodes: the UI calls the Life engine with the screen rectangle in Life grid coordinates, and a callback that is called on every live cell in that rectangle:

 void Draw(LifeRect rect, Action<LifePoint> setPixel);

The details of “zooming in” to draw the pixels as squares instead of individual pixels was entirely handled by the UI, not by the Life engine, which should make sense.

We can quickly determine whether a given quad is empty, and therefore we can optimize the drawing engine to skip over all empty quads without considering their contents further, and that’s great. But there is another possibility here: because we can determine quickly if a quad is not empty, we could implement “zooming out”, so that every on-screen pixel represented a 1-quad or a 2-quad or whatever; we can zoom out arbitrarily far.

Let’s implement it! I’m going to add a new method to Quad that takes four things:

  • What is the coordinate address of the lower left corner of this quad?
  • What rectangle, in Life coordinates, are we drawing?
  • The callback
  • What is the level of the smallest quad we’re going to draw? This is the zoom factor.
private void Draw(
  LifePoint lowerLeft, 
  LifeRect rect, 
  Action<LifePoint> setPixel, 
  int scale)
{
  Debug.Assert(scale >= 0);

If this quad is empty then that’s easy; there’s nothing to draw.

  if (IsEmpty)
    return;

The quad is not empty. If this quad does not overlap the rectangle, there’s nothing to draw. Unfortunately I chose to make the rectangle be “top left corner, width, height” but we are given the bottom left corner, so I have a small amount of math to do.

  if (!rect.Overlaps(new LifeRect(
      lowerLeft.X, lowerLeft.Y + Width - 1, Width, Width)))
    return;

(“Do these rectangles overlap?” is of course famously an interview problem; see the source code for my implementation.)

If we made it here we have a non-empty quad that overlaps the rectangle. Is this the lowest level we’re going to look at? Then set that pixel.

  if (Level <= scale)
  {
    setPixel(lowerLeft);
    return;
  }

We are not at the lowest level yet. Simply recurse on the four children; if any of them are empty or not overlapping, they’ll return immediately without doing more work.

  long w = Width / 2;
  NW.Draw(new LifePoint(lowerLeft.X, lowerLeft.Y + w), rect, setPixel, scale);
  NE.Draw(new LifePoint(lowerLeft.X + w, lowerLeft.Y + w), rect, setPixel, scale);
  SE.Draw(new LifePoint(lowerLeft.X + w, lowerLeft.Y), rect, setPixel, scale);
  SW.Draw(lowerLeft, rect, setPixel, scale);
}

That’s the internal logic; there is then a bunch of public wrapper methods around it that I will omit, and a new interface IDrawScale. We now have enough gear that we can start implementing our actual engine:

sealed class Gosper : ILife, IReport, IDrawScale
{
  private Quad cells;
  public Gosper()
  {
    Clear();
  }
  public void Clear()
  {
    cells = Empty(9);
  }
  public bool this[LifePoint p]
  {
    get => cells.Get(p) != Dead;

That’s all straightforward. But what about the setter? We’ve made a 9-quad, but what if we try to set a point outside of it? Fortunately it is very cheap to expand a 9-quad into a 10-quad, and then into an 11-quad, and so on, as we’ve seen.

    set
    {
      while (!cells.Contains(p))
        cells = cells.Embiggen();
      cells = cells.Set(p, value ? Alive : Dead);
    }
  }

The implementation of Embiggen is just what you would expect:

public Quad Embiggen()
{
  Debug.Assert(Level >= 1);
  if (Level >= MaxLevel)
    return this;
  Quad q = Empty(this.Level - 1);
  return Make(
    Make(q, q, NW, q), Make(q, q, q, NE),
    Make(SE, q, q, q), Make(q, SW, q, q));
}

And the rest of the methods of our Gosper class are uninteresting one-liners that just call the methods we’ve already implemented.


I’ve updated the UI to support “zoom out” functionality and created a Life engine that does not step, but just displays a Quad. Previously we could display a zoomed-in board:

or one pixel per cell:

But now we can display one pixel per 1-quad:

or one pixel per 2-quad, or whatever; we can zoom out arbitrarily far. If any cell in the quad is on, the pixel is on, which seems reasonable. This will let us much more easily visualize patterns that are larger than the screen.


Next time on FAIC: How do we step a quad forward one tick?

Approximate results may vary

Part 33 of my ongoing series is coming but I did not get all the code written that I wanted to this week, so it will be delayed. In the meanwhile:


Living in Canada as a child, of course I grew up learning the metric system (with some familiarity with the imperial and US systems of course). If you want to know how many milliliters of liquid a cubic box holds, you just compute the volume of the box in cubic centimeters and you’re done, because a milliliter is by definition the same volume as a cubic centimeter.

Despite having lived in Seattle for over 25 years, I still sometimes do not have an intuitive sense of conversions of US units; for a particular project I knew that I needed an amount of liquid equal to about the volume of a two-inch cube, but the bottle was measured in fluid ounces. Fortunately we have this operation in every browser:

Thanks Bing for those important eight digits after the decimal place there, manufactured from the zero digits after the decimal place of the input. For context, that extra 0043 on the end represents a volume equivalent to roughly the size of a dozen specs of microscopic dust.

But the punchline that made me LOL was “for an approximate result, divide the volume by 1.8046875”, because of course when I am quickly approximating the volume of eight cubic inches in fluid ounces, the natural operation that immediately comes to mind is divide by 1.8046875.

I have some questions.

That was Bing; what does Google do with the same query?

OK, that’s an improvement in that the amount of precision is still unnecessary, but not outright absurd. But the other problems are all the same.

Why division? Maybe it is just me, but I find division by uneven quantities significantly harder mental arithmetic than multiplication; assuming we want the over-precision, would it not be better to say “for an approximate result, multiply the volume by 0.5541125″? 

Why say “to approximate…” and then give an absurd amount of precision in the conversion factor? Approximation is by definition about deliberately not computing an over-precise value.

Surely the right way to say this is “for an approximate result, multiply the volume by 5/9” or even better, “divide by two“. When I saw “divide by 1.8046875” the first thing I thought after “wow that’s so over-precise” was “1.8 is 18/10 is 9/5, so multiply by 5/9“.

I’m going to get there eventually; software can shorten that journey. And I’m going to remember that 5/9ths of a cubic inch is a fluid ounce much more easily than I remember to divide by 1.8046875.

Once you start to see this pattern of over-precision in conversions, it’s like the FedEx arrow: you can’t stop seeing it. Let’s ask the browser how much does an American robin weigh?


64.8 grams. An underweight robin is not 64.5 grams, and not 65.0 grams but 64.8 grams.

To be fair, it looks like this over-precision was the fault of a human author (and their editor) not thinking clearly about how to communicate the fact at hand, rather than bad software this time; if you’re converting “two and a third ounces” to grams it would be perfectly reasonable to round to 65, or even 60. (That third of an ounce is already suspect; surely “two to three ounces” is just fine.) Most odd though is that the computations are not even correct! 2 and 1/3rd ounces is 66.15 grams, and 3 ounces is 85.05 grams, making it rather mysterious where the extra few hundred milligrams went.

I was wondering how many earthworms a robin would have to eat to make up a discrepancy of 0.2 grams. A largish earthworm has got to weigh on the order of a gram, right?

Wow! (For my metric readers out there: 1.5 pounds is 680.388555 grams according to Bing.)

Again: what the heck, Bing? I did not ask for “world’s largest earthworm” or “unusually large earthworms” or even “Australian earthworms”. You know where I live, Bing. (And Google search does no better.)

For some reason I am reminded of Janelle Shane’s “AI Weirdness” tweets; the ones about animal facts are particularly entertaining. These earthworm facts at least have the benefit of being both interesting and correct, but this is hardly the useful result about normal garden-variety annelids that I wanted.

I am also reminded of my favourite animal fact: the hippopotamus can jump higher than a house. It sounds impressive until you remember that houses can’t jump at all.

Obviously all these issues are silly and unimportant, which is why I chose them for my fun-for-Friday blog. And the fact that I can type “8 cubic inches in oz” into my browser or say “how much does a robin weigh?” into a smart speaker and instantly get the answer is already a user interface triumph; I don’t want to minimize that great work. But there is still work to do! Unwarranted extra precision is certainly not the worst sort of fallacious reasoning we see on the internet, but it is one of the most easily mitigated by human-focused software design. I’d love to see improvements to these search functions that show even more attention to what the human user really needs.