Last time on FAIC we took a look at Scholes’ extremely concise Life algorithm, which treats a grid as an array that you can treat as a mathematical value with some unusual but entirely straightforward manipulations. We didn’t get the same concision in C# as you would in APL, but I’m OK with that.
I won’t go through the same level of detail discussing the asymptotic performance of this algorithm as I did for the naïve algorithm. If we have n cells in the array then we do eight array manipulations, each of which allocates an array of size n and fills it in; each of these operations will be O(n). Similarly, the “Where” and “or” operations are O(n), and so the whole thing is O(n).
This shouldn’t be a surprise; this is in many ways just the naïve algorithm again! We do almost exactly the same work — we compute the sum of the living neighbors of every single cell, and set the new cell state based on that sum and each cell’s current state. We just do the operations in a slightly different order.
What about actual time and memory performance?
Plainly this algorithm as I have implemented it takes more temporary memory; it allocates thirteen arrays as it goes, and those arrays are then garbage that needs to be collected. The optimized version of the naive algorithm allocates only two arrays and it keeps both of them alive, so neither becomes garbage.
The garbage is at least short-lived and so will be collected quickly. But in my example of an 8-quad (256×256) byte array, we get in under the limit for allocation on the Large Object Heap. Things might be different if we moved up to a 9-quad because then all these temporary arrays would be large objects, and the GC assumes that large objects live longer. I haven’t tried it out to see what the impact is.
What about time?
As I said in a previous episode, when I make a performance prediction I am on average dead wrong maybe a third of the time. We saw that on my machine, after some small amount of very straightforward performance tuning the naïve algorithm took about 4 seconds to do 5000 ticks of an 8-quad; my prediction was that since Scholes’ algorithm is doing all the same amount of work, and allocating 13 temporary arrays as it goes, that it would be around the same but slightly slower due to all the copying.
Imagine my astonishment then when I discovered that my implementation of Scholes algorithm without any perf work at all took 3.25 seconds to do the same problem. Nearly 20% faster! I must confess, I do not know what is going on here, but I do know that those array copy steps are extremely fast. Unfortunately I do not have the time right now do to a detailed perf analysis to figure out what is going on here; if anyone has insights, please leave a comment.
Let me finish up this episode with three additional thoughts:
First, I noted last time that the algorithm I implemented is inspired by Scholes’ APL algorithm but is not exactly the same. How is it different?
The big thing is that my “array shift” operations are different than the array “rotations” used in the APL algorithm. That is, my “shift left” would transform an array like this:
1 2 3 2 3 0 4 5 6 --shift left-> 5 6 0 7 8 9 8 9 0
Whereas I believe — any APL aficionados reading this please confirm or deny — that the APL rotation is:
1 2 3 2 3 1 4 5 6 --rotate left-> 5 6 4 7 8 9 8 9 7
And similarly for shifting right, up and down.
I mentioned several episodes back that a standard technique for implementing Life algorithms is to make the edges of a finite grid “wrap around”, effectively making the board a torus. That’s what this algorithm does if you use this array rotation, and if you watch the video I linked in the previous episode you will see that in fact the given code implements wrap-around Life.
Second, I implemented the byte block data type in the least sophisticated manner possible: allocate an array on every operation. There are other ways to store data to make the sorts of operations we’re doing involve fewer array copies, and those could possibly reduce the time and memory consumption further. If you are clever (and the arrays are immutable) then you can instead of making a copy, instead just keep the original array and do a little extra math on every array read.
Though it would be interesting to know how much of an improvement (or regression!) those kinds of optimizations would achieve, I don’t want to spend too much time digging into this one algorithm. If anyone wants to do the experiment, please do and report back.
Third, one of the themes of this series that is emerging is that there are two basic ways to attack a performance problem:
- Keep the algorithm basically the same but find a marginally faster way to implement it. For instance: avoid array bounds checks by using unsafe pointer arithmetic, or throw specialized libraries or hardware at the problem. This does not improve asymptotic performance or scalability but it can lead to small or large wins in raw performance on problems of a specific size.
- Find some characteristic of the specific problem under investigation that enables us to come up with a new algorithm that does less work or uses less space or has less GC burden. This often gives us an improvement in asymptotic performance, and therefore changes how we can scale up to larger problems.
So far we’ve been concentrating entirely on the first family of techniques; we will get to the second soon!
I intended in this episode to talk a bit about the “use specialized hardware to attack the problem” technique, but I think this episode is long enough for today, so let’s pick up there next time.
Next time on FAIC: I’ll present an implementation submitted by a reader that uses specialized hardware instructions to implement Scholes’ algorithm on a 4-quad.