Code for this episode can be found here, which — unusually — is not my GitHub repo.

Last time on FAIC I mentioned that there were two basic techniques for improving raw performance:

- keep the algorithm the same but find a marginal improvement that makes an iteration of the algorithm run faster (or use less memory, or whatever)
- find some characteristic of the problem that allows us to use a better algorithm entirely

My hope was that we’d be past the first one by now, but a number of readers and commenters have brought up points that ought to be addressed, so I’m going to spend two more episodes on the “marginal improvement” road. Specifically, on the “throw specialized hardware at the problem” technique.

I said during our discussion of Scholes’ algorithm that, though it appears to be doing exactly the same computations as the standard implementation of the naive algorithm, it is conceptually different because *logically* it treats the Life problem as a series of operations on matrices; that is, *values*. My implementation of course had a bunch of O(n) loops in it, but those loops were implementation details of the operations like “shift this matrix to the left” or “add these eight matrices together”.

What if we could *actually* perform matrix operations in parallel across all matrix elements at the same time? That is, *truly* apply the operations we’re laboriously doing byte-by-byte in each array all at once, as though they were values?

After all, we do this for the bit arrays that we call integers! When we add two 32 bit integers together we do not think of the fact that the underlying hardware is doing operations on each bit. Under what circumstances could we do the same thing to Life grids?

I’m going to discuss two techniques for this. Today, SIMD, and next time, some ideas from image processing.

It is extremely common in modern computing to wish to perform the same operation to multiple integers. Let’s say, for example, we have 44000 short (16 bit) integers representing the waveform of one second of CD audio. We’re performing some transformation on all that data in real time, as the bits are coming off the disc. Let’s say we’re reducing the volume by half. Well that’s easy enough! It’s just a one-bit unsigned shift to the right.

What would be really great though is if we could copy, say, sixteen of those short integers out of a buffer and into a 256 bit register, shift all of them to the right *in parallel*, and then put them back in the buffer. Plainly that would be much faster than pulling 16 bits out of the buffer, shifting it, putting it back, repeat, *provided that we had hardware that could do the shifting in parallel on 16 shorts stuck into a 256 bit register.*

But we do have such hardware; a great many chips now support SIMD instructions. That is* Single Instruction on Multiple Data*. Moreover, .NET Core provides methods that allow you to insert calls to SIMD instructions into your C# programs; if the hardware supports it, then the jitter will turn those calls into these instructions that act in parallel.

How could we use that for Life? Suppose we restricted ourselves to a 4-quad. That is, a 16 x 16 grid in my nomenclature. That’s pretty small, but we could fit a 16×16 array of cells into an array of 16 shorts if we did one bit per cell, and then we could implement Scholes’s algorithm by doing parallel bit shifts for the left and right rotations, and pointer offsets for the up and down rotations.

As it turns out, we do not have to actually suppose anything at all; we can just examine the implementation sent to me by Jeroen Frijters which does exactly that. I’m not going to go through it line by line, but I’ll hit the highlights. We can divide the program into three parts: initialization, the core stepping algorithm, and the adder. (And there is also some debug output, which I will ignore.)

Initialization is straightforward; we have a 16×16 grid and we load it into a 256 bit buffer one ushort at a time:

var v = new Vector256<ushort>() v = v.WithElement(0x0, (ushort)0b0100_0000_0000_0000); v = v.WithElement(0x1, (ushort)0b0010_0000_0000_0111); ...

That buffer is then dumped into an 18-ushort fixed-width array.

fixed ushort state[18]; internal Life16x16(Vector256<ushort> initialState) { fixed (ushort* p = state) { Avx.Store(p + 1, initialState); } }

You might not be familiar with fixed-width arrays in C#, as they are typically only ever used for unmanaged code interop scenarios and the like. The basic idea is that we allocate enough memory to hold exactly 18 ushorts, and then tell the garbage collector *keep this thing still for a moment because I’m going to make a pointer into its interior*. (Remember, the GC is allowed to relocate the storage associated with living objects at its discretion.)

Unfortunately this leads to some nomenclature confusion because we then have a **fixed** **width** buffer that is **fixed in place** and boy is it easy to get those meanings mixed up if you are not careful. I really wish the designers of C# 1.0 had used “pinned” or “immovable” or some other term to distinguish the two meanings of “fixed” in this context.

Why 18? We shall see in a moment.

That does it for the initialization phase. The heart of the algorithm is of course the step function, and let me take this opportunity to rewrite it very slightly to make it easier to understand:

fixed (ushort* p = state) { var cells = Avx.LoadDquVector256(p + 1); var s = Avx.LoadDquVector256(p); var n = Avx.LoadDquVector256(p + 2); var e = Avx2.ShiftRightLogical(cells, 1); var se = Avx2.ShiftRightLogical(s, 1); var sw = Avx2.ShiftLeftLogical(s, 1); var w = Avx2.ShiftLeftLogical(cells, 1); var ne = Avx2.ShiftRightLogical(n, 1); var nw = Avx2.ShiftLeftLogical(n, 1);

We copied the 16 ushorts worth of board state into the interior of the 18 byte buffer, so that there would be an all-zero word at the top and bottom. That way the “shift north” and “shift south” operations are nothing more than reading the bits starting from the first and third ushorts in the fixed-size array.

The shifts to the east and west are done in parallel on all 256 bits in the vector. Something to keep in mind here is that this is not a right or left shift of a 256 bit integer! That would be wrong because the edges would wrap incorrectly. Rather, it is truly doing a single shift-by-one operation on each of the 16 ushorts in the bit vector, so we get a zero bit on the left or right appropriately. Recall that when I implemented this a couple episodes ago on a byte array I needed to add code to ensure that zeros were filled in appropriately, but here we get that for free from the hardware.

We then have the remainder of Scholes’ algorithm, though it is slightly more verbose than I wrote:

var acc = new FourBitAccumulator(se, s, sw, e); acc.Add(cells); acc.Add(w); acc.Add(se); acc.Add(n); acc.Add(sw); Avx.Store(p + 1, Avx2.Or(acc.IsThree, Avx2.And(acc.IsFour, cells)));

What’s happening here is we have a helper class that implements the “add these nine bit vectors together to get the living neighbor count”, which of course fits into four bits.

We then do the same thing as before: call helper methods to find all the threes and living fours, and that’s our next generation.

The remainder of the program is a four-bit parallel adder that also uses SIMD instructions to add all nine 256 bit vectors together in parallel, which produces four 256 bit vectors each of which has one bit of the sum. Going through the code to verify that it is a correct adder is tedious and spoiler alert, it is correct, so I’ll skip that.

What kind of performance boost would we get if we used this thing? We can do a quick calculation to get an order of magnitude.

My implementation of Scholes’ algorithm took 3.25 seconds to compute 5000 generations of an 8-quad — that is, a 256×256 cell grid. That works out to 100 megacells/s.

I did a quick performance check of this algorithm and I got **5 million ticks per second**, but only doing the work on 256 cells. That’s 1.3 gigacells/s, 13 times faster on a cells-per-second basis. That seems like the right order of magnitude speedup considering that all the work is being done in parallel.

Now, of course scaling this algorithm up to use SIMD instructions on 8-quads would not be simply doing 256 times the work as was done on a 4-quad because we cannot use the fact that the north, south, east and west shifts put zeros along the edges! We need those bits to be filled in *with the correct bits from the grid state*, and that’s going to be many additional instructions, so maybe this would only be 10x or 8x faster after that. But 8x faster is pretty good.

On the other hand, remember what our scalability problem is here. Every time we double the width of the grid we get 4x as many cells in it. So what this means is that if an n-quad is within our performance budget for my unoptimized algorithm, then an (n+1) or maybe if we are lucky, an (n+2) quad is in our budget for this SIMD version assuming we could scale it up to larger grids. It doesn’t sound like nearly so great an increase when you look at it that way.

Many thanks to Jeroen for inspiring this episode and providing the code.** Next time on FAIC** we’ll finish off our discussion of parallelism with a quick introduction to a basic image processing technique that is germane to our topic.

Timing is interesting. Throwing specialized hardware at a problem is indeed one approach. Almost 2 decades ago I was approached to help with an intended art exhibit. The floor of a large room made up of 3″ square Life Cells (with a twist, pressure sensors could contaminate or kill a cell in additional to the standard rules)….. Mat really makes the timing special, is I found a box full of the PCB’s from that project this past week….

Pingback: The Morning Brew - Chris Alcock » The Morning Brew #2994

This might be my favourite series yet!

What’s also quite interesting is when you measure the efficiency of your algorithm in terms of work done on a per single-threaded CPU clock cycle, e.g 1.3 gigacells/sec on a 3.6ghz cpu is around 0.36 cells per clock cycle. This absolutely does not mean that it only takes 3 cycles to update a single cell, and you could also argue that the SIMD stuff is parallelizing the problem in a similar way to how multi-threading would. But the fact you can get so close to one cell per single-threaded clock cycle even at this stage, IMHO, hints at something fundamental about the problem you’re trying to solve, which alas I’m struggling to put into words. I guess it’s just a very ‘compressible’ problem, as I think your final optimisation will show.

I’m glad you’re enjoying it! I am too. I’ve been meaning to do this for over a decade.

Indeed we will discover as this series goes on that it is incredibly “compressible”; we’ll compress boards in both space *and time*. Stay tuned!

Pingback: The Morning Brew #2994 -