An inheritance puzzle, part two

Today, the answer to Friday’s puzzle. It prints Int32. But why?

Some readers hypothesized that M would print out Int32 because the declaration B : A<int> somehow tells B that T is to be treated as int, now and forever.

Though the answer is right, the explanation is not quite right. One can illustrate this by taking C out of the picture. If you say (new A<string>.B()).M(), you’ll get String. The fact that B is an A<int> doesn’t make T always int inside B!

The always-keen Stuart Ballard was the first to put his finger upon the real crux of the problem — but he’s seen this problem before, so he had an advantage. Eamon Nerbonne was the first to post a complete and correct explanation.

The really thorny issue here is that the declaration class C : B is upon close inspection, somewhat ambiguous. Is that equivalent to class C : A<T>.B or class C : A<int>.B?

Clearly it matters which we choose. A method group can only be treated as a member if the method is on a base class. Merely being on an outer class doesn’t cut it:

public class X 
{ 
  public void M() { } 
}
public class Y 
{
  public void N() { }
  public class Z : X { }
}
...
// legal, from base class
(new Y.Z()).M(); 
// illegal -- outer class members are not members of inner classes.
(new Y.Z()).N(); 

In our example, when we called M on an instance of A<string>.B.C, it was calling a method of the base class of C. If the base class of C is A<T>.B, then that should call A<T>.B.M, and it should print out whatever the current value of T is — in this case, String. If the base class of C is A<int>.B, then that should call A<int>.B.M, so it should print out Int32.

We choose the latter as the base class. That was certainly a surprise to me. And Stuart Ballard. And, amusingly enough, when I sprang this one upon Anders and didn’t give him time to think about it carefully, it was a surprise to him as well. When I sprang it on Cyrus, he cheerfully pointed out that he already posted a harder version of this problem back in 2005, the solution of which Stuart characterized back then as “Insanely complex, but it makes perfect sense.” I couldn’t agree more, though at least my version of the puzzle is somewhat simpler.

Anyway, why on earth ought that to be the case? Surely the B in class C : B means the immediately containing class, which is A<T>.B, not A<int>.B! And yet it does not.

These generics are screwing up our intuitions. Let’s look at an example which has no generics at all:

public class D 
{
  public class E {}
}
public class F 
{
  public class E { }
  public class G 
  {
    public E e; // clearly F.E
  }
}
public class H : D 
{
  public E e; // clearly D.E
}

This is all legal so far, and should be pretty clear. When we are binding a name to a type, the type we get is allowed to be a member of any base class or a member of any outer class. But what if we have both to choose from?

public class J 
{
  public class E { }
  public class K : D 
  {
    public E e; // Is this J.E or D.E?
  }
}

We could just throw up our hands and say that this is ambiguous and therefore illegal, but we’d rather not do that if we can avoid it. We have to prefer one of them, and we’ve decided that we will give priority to base classes over outer classes. Derived classes have an “is a kind of” relationship with their base classes, and that is logically a “tighter” binding than the “is contained in” relationship that inner classes have with outer classes.

Another way to think about it is that all the members you get from your base class are all “in the current scope”; therefore all the members you get from outer scopes are given lower priority, since stuff inside inner scopes takes priority over stuff in outer scopes.

The algorithm we use to search for a name used in the context of a type S is as follows:

  • search S’s type parameters
  • search S’s accessible inner classes
  • search accessible inner classes of all of S’s base classes, going in order from most to least derived
  • S←S’s outer class, start over

(And if that fails then we invoke the whole mechanism of searching the namespaces that are in scope, checking alias clauses, etc.)

With that in mind, now the solution should finally make some sense. At the point where we are resolving the base class of C we know that C has no type parameters. We do not know what the base class or inner classes of C are — that’s what we’re trying to figure out — so we skip checking them.

The next thing we check is the outer class, which is A<T>.B, but we do NOT say, aha, the outer class is called B, we’re done. That is not at all what the algorithm above says. Instead, it says check A<T>.B to see if it has a type parameter called B or an inner type called B. It does not, so we keep searching.

The base type of A<T>.B is A<int>. The outer type of A<T>.B is A<T>. Both have an accessible inner class called B. Which do we pick? The base type gets searched first, so B resolves to A<int>.B. Obviously.

Having members of base classes bind tighter than members from outer scopes can lead to bizarre situations but they are generally pretty contrived. For example:

public class K 
{ 
  public class L { } 
}
public class L : K 
{
  L myL; // this is K.L!
}

And of course, you can always get around these problems by eliminating the ambiguity:

public class A<T> 
{
  public class B : A<int> 
  {
    public void M() { ... }
    // no longer ambiguous which B is picked.
    public class C : A<T>.B { } 
  }
}

Finally, the specification of this behaviour is a bit tricky to understand. The spec says:

Otherwise, if T contains a nested accessible type having name I and K type parameters, then the namespace-or-type-name refers to that type constructed with the given type arguments. If there is more than one such type, the type declared within the more derived type is selected.

By “if T contains a nested accessible type”, it means “if Tor any of its base classes, contains a nested accessible type”. I completely failed to comprehend that the first n times I read that section. I’ll see if I can get that clarified in the next version of the standard.

An inheritance puzzle, part one

Once more I have returned from my ancestral homeland, after some weeks of sun, rain, storms, wind, calm, friends and family. I could certainly use another few weeks, but it is good to be back too.

Well, enough chit-chat; back to programming language design. Here’s an interesting combination of subclassing with nesting. Before trying it, what do you think this program should output?

public class A<T> 
{
  public class B : A<int> 
  {
    public void M() 
    {
      System.Console.WriteLine(typeof(T).ToString());
    }
    public class C : B { }
  }
}
class MainClass 
{
  static void Main() 
  {
    A<string>.B.C c = new A<string>.B.C();
    c.M();
  }
}

Should this say that T is int, string or something else? Or should this program not compile in the first place?

It turned out that the actual result is not what I was expecting at least. I learn something new about this language every day.

Can you predict the behaviour of the code? Can you justify it according to the specification? (The specification is really quite difficult to understand on this point, but in fact it does all make sense.)

The answer is in the next episode!

Why are overloaded operators always static in C#?

A language design question was posted to the Microsoft internal C# discussion group this morning: “Why must overloaded operators be static in C#? In C++ an overloaded operator can be implemented by a static, instance or virtual method. Is there some reason for this constraint in C#?

Before I get into the specifics, there is a larger point here worth delving into, or at least linking to. Raymond Chen immediately pointed out that the questioner had it backwards. The design of the C# language is not a subtractive process; though I take Bob Cringely’s rather backhanded compliments from 2001 in the best possible way, C# is not Java/C++/whatever with the kludgy parts removed. Former C# team member Eric Gunnerson wrote a great article about how the process actually works.

Rather, the question we should be asking ourselves when faced with a potential language feature is “does the compelling benefit of the feature justify all the costs?” And costs are considerably more than just the mundane dollar costs of designing, developing, testing, documenting and maintaining a feature. There are more subtle costs, like, will this feature make it more difficult to change the type inferencing algorithm in the future? Does this lead us into a world where we will be unable to make changes without introducing backwards compatibility breaks? And so on.

In this specific case, the compelling benefit is small. If you want to have a virtual dispatched overloaded operator in C# you can build one out of static parts very easily. For example:

public class B {
    public static B operator+(B b1, B b2) { return b1.Add(b2); }
    protected virtual B Add(B b2) { // …

And there you have it. So, the benefits are small. But the costs are large. C++-style instance operators are weird. For example, they break symmetry. If you define an operator+ that takes a C and an int, then c+2 is legal but 2+c is not, and that badly breaks our intuition about how the addition operator should behave.

Similarly, with virtual operators in C++, the left-hand argument is the thing which parameterizes the virtual dispatch. So again, we get this weird asymmetry between the right and left sides. Really what you want for most binary operators is double dispatch — you want the operator to be virtualized on the types of both arguments, not just the left-hand one. But neither C# nor C++ supports double dispatch natively. (Many real-world problems would be solved if we had double dispatch; for one thing, the visitor pattern becomes trivial. My colleague Wes is fond of pointing out that most design patterns are in fact necessary only insofar as the language has failed to provide a needed feature natively.)

And finally, in C++ you can only define an overloaded operator on a non-pointer type. This means that when you see c+2 and rewrite it as c.operator+(2), you are guaranteed that c is not a null pointer because it is not a pointer! C# also makes a distinction between values and references, but it would be very strange if instance operator overloads were only definable on non-nullable value types, and it would also be strange if c+2 could throw a null reference exception.

These and other difficulties along with the ease of building your own single (or double!) virtual dispatch mechanisms out of static mechanisms, makes it easy to decide to not add instance or virtual operator overloading to C#.

Chained user-defined explicit conversions in C#

Reader Niall asked me why the following code compiles but produces an exception at runtime:

class Base {}
class Derived : Base {}
class Castable 
{
    public static explicit operator Base() 
    {
      return new Base(); 
    }
}
// ...
Derived d = (Derived)(new Castable());

It should be clear why this produces an exception at runtime; the user-defined operator returns a Base and that is not assignable to a variable of type Derived. But why does the compiler allow it in the first place?

First off, let’s define the difference between an implicit and an explicit conversion. An implicit conversion is one which the compiler knows can always be done without incurring the risk of a runtime exception. When you have a method int Foo(int i){...} and call it with long l = Foo(myshort);, the compiler inserts implicit conversions from int to long on the return side and from short to int on the call side. There is no int which doesn’t fit into a long and there is no short which doesn’t fit into an int, so we know that the conversions will always succeed, so we just up and do them for you.

There are also conversions which we know at compile time will never succeed. If there is no user-defined conversion from Giraffe to int, then Foo(new Giraffe()) is always going to fail at runtime, so this fails at compile time.

An explicit conversion is a conversion which might succeed sometimes but might also fail. We cannot disallow it, because it might succeed, but we can’t go silently inserting one either, since it might fail unexpectedly. We need to force the developer to acknowledge that risk explicitly. If you called ulong ul = Foo(mynullableint); then that might fail, so the compiler requires you to spell out that the conversions are explicit. The assignment could be written ulong ul = (ulong)Foo((int)mynullableint);.

There are two times that the compiler will insert an explicit cast for you without producing a warning or error. The first is the case above. When a user-defined explicit cast requires an explicit conversion on either the call side or the return side, the compiler will insert the explicit conversions as needed. The compiler figures that if the developer put the explicit cast in the code in the first place then the developer knew what they were doing and took the risk that any of the conversions might fail. That’s what the cast means: this conversion might fail, I will deal with it.

I understand that this puts a burden upon the developer to fully understand the implications of a cast, but the alternative is to make you spell it out even further, and it just gets to be too much. The logical extreme of this would be a case such as

public struct S
{
    public static explicit operator decimal?(S s) { return 1.0m; }
}
//...
S? s = new S();
int i = (int) s;

Here we do first an explicit conversion from S? to S, then a user-defined explicit conversion from S to decimal?, then an explicit conversion from decimal? to decimal, and then an explicit conversion from decimal to int. That’s four explicit conversions for the price of one cast, which I think is pretty good value for your money.

I want to note at this point that this is as long as the chain gets. A user-defined conversion can have built-in conversions inserted automatically on the call and return sides, but we never automatically insert other user-defined conversions. We never say that there’s a user-defined conversion from Alpha to Bravo, and a user-defined conversion from Bravo to Charlie, and therefore casting an Alpha to a Charlie is legal. That doesn’t fly.

And a built-in conversion can be lifted to nullable, which may introduce additional conversions as in the case above. But again, these are never user-defined.

The second is in the foreach loop. If you have foreach(Giraffe g in myAnimals) then we generate code which fetches each member of the collection and does an “explicit” conversion to Giraffe. If there happens to be a Snail or a WaterBuffalo in myAnimals, that’s a runtime exception. I considered adding a warning to the compiler for this case to say hey, be aware that your collection is of a more general type, this could fail at runtime. It turns out that there are so many programs which use this programming style, and so many of them have “compile with warnings as errors” turned on in their builds, that this would be a huge breaking change. So we opted to not do it.

Five-Dollar Words for Programmers, Part One: Idempotence

Programmers, particularly those with a mathematical background, often use words from mathematics when describing their systems. Unfortunately, they also often do so without consideration of the non-mathematical background of their colleagues. I thought I might talk today a bit about the word “idempotent”. This is a very easy concept but lots of people don’t know that there is a word for it.

There are two closely related mathematical definitions for “idempotent. First, a value is “idempotent under function foo” if the result of doing foo to the value results in the value right back again. Another common way to say this is that the value is a fixpoint of foo.

Second, a function is “idempotent” if the result of doing it twice — where we are feeding the output of the first call into the second call — is exactly the same as the result of doing it once. Or, in other words, every output of the function is idempotent under it. Or in even more other words, every output of an idempotent function is a fixpoint of the function.

The most trivial mathematical example of the second kind is the constant function. If f(x) = c, then clearly f(x) = f(f(x)) = f(f(f(x))) … so f is idempotent (and the constant is idempotent under it).

The identity function f(x) = x is also idempotent; I’m sure you see why.

The function that takes a finite set of numbers and returns a set containing only its largest element is idempotent (and every one-element set is idempotent under it). I’m sure you can think of lots of examples of idempotent functions.

To get a handle on the other sense, pick an operation — say, doubling. The only value which is idempotent under that operation is zero. The operation “subtracting any non-zero value” has no idempotent values. Squaring has two idempotent values, zero and one.

The way programmers use “idempotent” is slightly different, but this concept comes up all the time in practical programming, particularly around caching logic. Usually when used in the computer science sense we don’t mean that the effect of the function is invariant under composition of the function with itself, but rather that it is invariant over any number of calls greater than one. For example, I don’t know how many times I’ve written:

HRESULT GetTypeLibCreator(ICreateTypeLib2 ** ppctl) 
{
  if (this->m_pctl == NULL) 
  {
    HRESULT hr; 
    hr = CreateTypeLib2(SYS_WIN32, pszName, &this->m_pctl);
    if (FAILED(hr)) 
      return hr;
  }
  *ppctl = this->m_pctl;
  this->m_pctl->AddRef();
  return S_OK;
}

A nice little idempotent function — calling it two, three, or any other number of times with the same arguments has exactly the same result as calling it once.

The place you see the other characterization of idempotence all the time is in C++ header files. Include a needed header zero times and you’ll get “not defined” errors. Accidentally include it twice and you’ll get “redefinition” errors. It’s a major pain to make sure that every header file is included exactly once. Therefore, most headers use some trick to make them idempotent under the inclusion operation:

#ifndef STUFF_H_INCLUDED
#define STUFF_H_INCLUDED

// headers here

#endif // STUFF_H_INCLUDED

or in more modern systems, the “#pragma once” directive makes headers idempotent under inclusion.

Floating Point Arithmetic, Part One

A month ago I was discussing some of the issues in integer arithmetic, and I said that issues in floating point arithmetic were a good subject for another day. Over the weekend I got some questions from a reader about floating point arithmetic, so this seems like as good a time as any to address them.

Before I talk about some of the things that can go terribly wrong with floating point arithmetic, it’s helpful (and character building) to understand how exactly a floating point number is represented internally.

To distinguish between decimal and binary numbers, I’m going to do all binary numbers in fixed-width.

Here’s how floating point numbers work. A float is 64 bits. Of that, one bit represents the sign: 0 is positive, 1 is negative.

Eleven bits represent the exponent. To determine the exponent value, treat the exponent field as an eleven-bit unsigned integer, then subtract 1023. However, note that the exponent fields 00000000000 and 11111111111 have special meaning, which we’ll come to later.

The remaining 52 bits represent the mantissa.

To compute the value of a float, here’s what you do.

  • Write out all 52 bits of the mantissa.
  • Stick a 1. onto the left hand side.
  • Compute that value as a 53 bit fraction with 52 fractional places.
  • Multiply that by two to the power of the given exponent value.
  • Sign it appropriately.

So for example, the number -5.5 is represented like this: (sign, exponent, mantissa)

(1, 10000000001, 0110000000000000000000000000000000000000000000000000)

The sign is 1, so it is a negative number.

The exponent is 1025 – 1023 = 2.

Put a 1. on the top of the mantissa and you get

1.0110000000000000000000000000000000000000000000000000

which in decimal is 1.375 and sure enough, -1.375 x 22 = -5.5

This system is nice because it means that every number in the range of a float has a unique representation, and therefore doesn’t waste bits on duplicates.

However, you might be wondering how zero is represented, since every bit pattern has 1. plunked onto the left side. That’s where the special values for the exponent come in.

If the exponent is 00000000000, then the float is considered a “denormal”. It gets 0. plunked onto the left side, not 1., and the exponent is assumed to be -1022.

This has the nice property that if all bits in the float are zero, it is representing zero.

Note that this lets you represent smaller numbers than you would be able to otherwise, as we’ll see, though you pay the price of lower precision. Essentially, denormals exist so that the chip can do “graceful underflow” — represent tiny values without having to go straight to zero.

If the exponent is 11111111111 and the mantissa is all zeros, that’s Infinity.

If the exponent is 11111111111 and the mantissa is not all zeros, that’s considered to be Not A Number — this is a bit pattern reserved for errors.

So the biggest positive normalized float is

(0, 11111111110, 1111111111111111111111111111111111111111111111111111)

which is

1.1111111111111111111111111111111111111111111111111111 x 21023.

The smallest positive normalized float is

(0, 00000000001, 0000000000000000000000000000000000000000000000000000)

which is

1.000 x 2-1022

The biggest positive denormalized float is:

(0, 00000000000, 1111111111111111111111111111111111111111111111111111)

which is

0.1111111111111111111111111111111111111111111111111111 x 2-1022

The smallest positive denormalized float is

(0, 00000000000, 0000000000000000000000000000000000000000000000000001)

which is

0.0000000000000000000000000000000000000000000000000001 x 2-1022 = 2-1074

 


Next time on FAIC: floating point math is nothing like real number math.

250% of What, Exactly?

I just got a question this morning about how to take two collections of items and determine how many of those items had the same name. The user had written this straightforward but extremely slow VBScript algorithm:

For Each Frog In Frogs
  For Each Toad In Toads
    If Frog.Name = Toad.Name Then
      SameName = SameName + 1
      Exit For
    End If
  Next
Next

There were about 5000 frogs, 3000 toads and 1500 of them had the same name. Every one of the 3500 unsuccessful searches checked all 3000 toads, and the 1500 successful searches on average checked 1500 toads each. Each time through the inner loop does one loop iteration, two calls to the Name property, one string comparison. Add all those up and you get roughly 50 million function calls to determine this count.

This code has been somewhat simplified – the actual user code was doing more work inside the loop, including calls to WMI objects. The whole thing was taking 10+ minutes, which is actually pretty fast considering how much work was being done. Each individual function call was only taking a few microseconds, but fifty million calls adds up!

Now, we’ve been down this road before in this blog (herehere and here) and so of course I recommended building a faster lookup table rather than doing a full search through the collection every time.

Set FrogLookup = CreateObject("Scripting.Dictionary")
For Each Frog In Frogs
  FrogLookup(Frog.Name) = Frog
Next
For Each Toad In Toads
  If FrogLookup.Exists(Toad.Name) Then SameName = SameName + 1
Next

Which is much, much faster. That’s only about 16 thousand function calls. Now, this is maybe not an apples-to-apples comparison of function calls, but we at least have good reason to believe that this is going to be several times faster. 

And indeed it was. But I’m still not sure how much, which brings me to the actual subject of today’s blog. The user reported that the new algorithm was “250% better”. I hear results like this all the time, and I always have to ask to clarify what it means. You can’t just say “n% better” without somehow also communicating what you’re measuring.


(UPDATE: This reported result understandably confused some readers.  Clearly the new loop here is thousands of times faster, not a mere 250% faster. As I said before, the sketch above is highly simplified code. The real-world code included many thousands of additional calls to WMI objects which were not eliminated by this optimization. Eliminating these 50 million function calls helped — you should always eliminate the slowest thing first!  But doing so also exposed a new “hot spot” that needed further optimization.  However, the point of this article is not the benefits of using lookup tables, but rather that using unexplained percentages to report performance results is potentially misleading.  The result above is merely illustrative.  See the comments for more details.)


Allow me to illustrate. Suppose I have a web server that is serving up ten thousand pages every ten seconds. I make a performance improvement to the page generator so that it is now serving up fifteen thousand pages every ten seconds. I can sensibly say:

  • performance has improved by 50%, because we are now serving up 5000 more pages every ten seconds, and 5000 is 50% of the original 10000. In this world, any positive percentage is good.
  • performance is now 150% of original performance because 15000 is 150% of 10000. In this world, 0%-100% worse or the same, 100%+ is good.
  • We’ve gone from 1000 microseconds per page to 667 microseconds per page, saving 333 microseconds per page. 333 is 33% of 1000, so we’ve got a 33% performance improvement.  In this world, 0% is bad, 100% is perfect,  more than 100% is nonsensical.

If I can sensibly say that performance is better by 50%, 150% and 33%, all referring to exactly the same improvement, then I cannot actually be communicating any fact to my listeners! They need to know whether I’m measuring speed or time, and if speed, whether I’m comparing original speed to new speed or original speed to the difference.

So what is “250%” better? Clearly not raw timings. If this loop took 10 minutes to run before, 250% of 10 minutes is 25 minutes, but surely it is not running in -15 minutes now! I assume that the measurement is speed — say, number of loops runnable per hour. If before it took ten minutes and therefore we could run this loop six times per hour, and now it is 250% better, 250% of 6 is 15, and this is “better”, so we need to add. So that’s 21 loops per hour, or about 3 minutes per loop. Had the user accidentally said “250% faster” but meant to say “250% of previous performance” then we’d conclude that 250% of 6 per hour is 15 per hour, so now we’ve got 4 minutes per loop.

And of course, this is assuming that the person reporting the improvement is actually trying to communicate facts to the audience. Be wary! Since 33%, 50% and 150% are all sensible ways to report this improvement, which do you think someone who wants to impress you is going to choose? There are opportunities here for what Kee Dewdney calls “percentage pumping” and other shyster tricks for making weak numbers look more impressive. Professor Dewdney’s book on the subject, “200% of Nothing”, is quite entertaining.

The moral of the story is: please do not report performance improvements in percentages. Report performance improvements in terms of raw numbers with units attached, such as “microseconds per call, before and after change”, or “pages per second, before and after change”. That gives the reader enough information to actually understand the result.

(And the other moral is, of course, lookup tables are your friends.)

High-Dimensional Spaces Are Counterintuitive, Part Five

All of this stuff about high dimensional geometry has been known for quite a while. The novel bit that this paper is all about is how to actually build a fast index for searching a high-dimensional space where the query has a strong likelihood of being junk. The idea that these smart Microsoft researchers came up with is called Redundant Bit Vectors, and it goes like this:

Suppose we’ve got a 100 dimensional space of 8-byte floats, with one million records to search. Each record point has a “tolerance hypercube” constructed around it — if a query point falls into that cube, then the record is a likely candidate for a match. Before we do any searches we build an index – building the index can take as long as we like so long as searching it is fast and the index is small.

To build the index, divide every dimension up into non-overlapping “buckets”. The algorithm for choosing the right number of buckets and the best bucket sizes isn’t particularly interesting, so without loss of generality let’s just assume that you’ve got four equally sized buckets for each dimension and move on. We’ve got 100 dimensions, four buckets per dimension, that’s 400 buckets. Now associate each bucket with a one-million-bit integer. That’s a big integer – what integer do we pick?

Consider the third bucket for the 57th dimension. There’s a million-bit integer associated with that bucket. Set the nth bit of the integer to 1 if the nth database record’s tolerance hypercube overlaps any part of the third bucket, considering only the 57th dimension, 0 if it does not. That is, this enormous integer answers the question “if the 57th dimension of the query is in the third bucket, what are all the database entries that we can immediately eliminate?” The ones that we can eliminate have zeros in their position in the integer.

Once we build those 400 million-bit integers, doing a query is a snap. First, figure out which 100 buckets the query point is in, one bucket per dimension. That gives us 100 one-million-bit integers. Binary AND those 100 integers together, and the million-bit integer that results has bits on only for those points which have tolerance hypercubes that overlap the query bucket in all 100 dimensions.

If the bucket sizes are chosen well then that should be a tiny number of points. Ideally, each of the four buckets per dimension would match only 25% of the points, but that’s not likely – maybe the tolerance rectangles are large enough and the dimensional data are interrelated enough that each bucket only eliminates 50% of the points. But even if that sub-optimal situation arises, each AND eliminates another 50%.

The first million-bit integer for the first dimension gets us down to 500000 candidate points, the second down to 250000, the third 125000, and by the time we get to the twentieth dimension we should be down to only a handful of points, if any. Detecting junk is very fast as the result will quickly go to all zeros. If it isn’t junk and you discover a handful of match candidates then you can pull out the old Cartesian distance metric to check whether the query point is really a good match for any of them.

Obviously there are no chips that can AND together one hundred integers of a million bits each – most chips do it in 32 or 64 bit chunks, two at a time. But by being clever about how the bits are stored you can ensure good cache locality, so that the chip is almost always ANDing together portions of the integers which are close together in memory. And by being clever about keeping track of when there are huge long runs of zeros in the result as it is being computed, you can make the ANDing process faster by skipping regions that you know are going to go to zero.

Is this index small enough to keep in main memory? 400 million-bit integers is 50 MB. That’s a big chunk of memory, but certainly more reasonable than keeping the entire 800 MB database in memory.

Note that this is still a linear-time algorithm – if we’ve got n records then we’ve got to work with an n-bit integer, and that will take O(n) time. But because AND is such a fast operation compared to loading records into memory and computing Cartesian distances, in practice this is a winning algorithm.

Well, that was entertaining. Next time we’ll get back down to earth — I’ll answer a few recent and not-so-recent questions about scripting.

High-Dimensional Spaces Are Counterintuitive, Part Four

It’s reasonably common to have a database containing a few thousand or million points in some high-dimensional space. If you have some “query point” in that space you might like to know whether there is a match in your database.

If you’re looking for an exact match then the problem is pretty easy — you just come up with a suitable hash algorithm that hashes points in your space and build a big old hash table. That’s extremely fast lookup. But what if you’re not looking for an exact match, you’re looking for a close match? Perhaps there is some error, either in the measurements that went into the database or the measurement of the query point, that needs to be accounted for.

Now the problem is “for a given query point, what is the closest point in the database to it?” That’s not a problem that’s amenable to hashing. (Well, you could use a Locality Sensitive Hash algorithm, but we’ll rule that out later.)

What if the closest point to the query point is so far away from the query point that it’s more likely that the query point simply has no match at all in the database? In that case we don’t really want the closest point, because that’s not really relevant.

Essentially you can think of every point in the database as being surrounded by a “probability sphere”. As you move farther away from a given point in the database, the probability that you have a match to that point gets smaller and smaller. Eventually its small enough that the probability that the query point is “junk” — not a match to any point in the database at all — gets larger than the probability that the closest point is a match.

To sum up the story so far: we’ve got a query point, which may or may not correspond to a point in our database of points. We need a way to say “is this point junk? If not, what are some reasonably close points in the database that might be matches?”

Here’s an idea for an algorithm: compute the distance between the query point and every point in the database. Discard all points where the distance is outside of a certain tolerance. Geometrically, we’re constructing a sphere of a certain radius around each point in the database. We check to see whether the query point is inside each sphere.

We know that the volume of an n-sphere is tiny. That’s good. If we do get a match then we know that the match is probably in a very small volume and therefore likely to be correct. However, we also know that such a system is not tolerant of small errors in many dimensions — because distances grow so fast, a small deviation in many dimensions leads to a big distance. That means that we probably will have to construct a sphere with a fairly large radius in order to allow for measurement error in many dimensions.

But that’s not really so bad. What’s bad about this scheme is that if there are a million points in the database then we have to calculate one million Cartesian distances for every query. In a 100-dimensional space that means computing 100 differences, 100 multiplications, 100 additions and one comparison a million times just to do one query.

We could build some optimizations in — we could check at each addition whether the total radius computed so far exceeds the tolerance and automatically discard the point. Maybe we could cut it down to on average 50 differences, multiplications and additions per point — but then we’ve just added in 50 comparison operations. No matter how you slice it, we’re doing a lot of math.

A hash table works by automatically discarding all but a tiny number of points, so that you just have to check a small number rather than the whole database. A Locality Sensitive Hash algorithm (which I might write about in another series) would work well here except that LSH algorithms have lousy performance if a large percentage of the queries are junk. Let’s assume that there are going to be a lot of junk queries. Is there some way that we can more rapidly find valid points?

We learned earlier that 99.9% of the volume of an n-sphere is contained by an n-cube with the same center as the sphere and an edge length fairly close to that of the radius.

Hmm.

Determining if a point is inside a cube is a lot less math than determining if a point is inside a sphere. With a cube you don’t need to compute any distance. You just compare the upper and lower boundaries of each dimension to the position of the point in that dimension and see if they overlap. For a 100-dimensional space, that’s 100 to 200 comparisons per point, and as soon as even one of the dimensions is bad, you can skip this cube and move on.

Maybe we can get it down to around a few dozen comparisons per point for a straightforward linear search. That’s pretty good compared to the distance metric.

We know that a cube of a given side is much, much more voluminous than a sphere of similar radius. We don’t really care about the 0.1% false negative rate caused by clipping off the bits of the sphere that are close to the axes. But what about the false positive rate of the huge volume of points that are inside the cube but not inside the sphere? These are easily dealt with: once we have winnowed the likely matches down to a small subset through the cheap cube method, we can do our more expensive spherical check on the small number of remaining candidates to eliminate false positives.

By this point your bogosity sense should be tingling.

This armchair performance analysis is completely bogus. Yes, ~70 comparisons is cheaper than ~100 subtractions, multiplications and additions. Who cares? That’s not the most expensive thing. Performance analysis always has to be about what the most expensive thing is.

Think about this database for a moment — a million records, each record is a 100-dimensional point in some space. Let’s suppose for the sake of argument that it’s a space of 8-byte double-precision floats. That’s 800 megabytes of memory. Now, there are certainly database servers out there that can keep an 800 meg file in physical memory, but we’re clearly pushing an envelope here. What if the database is large enough that there isn’t enough physical memory to keep the whole thing in all at once? At that point, the cost of iterating over all one million records becomes the cost of swapping big chunks of the database to disk and back.

Now consider what happens if there are multiple threads doing searches at the same time. Unless you can synchronize them somehow so that they all use the same bits of the database at the same time, you’re just going to exacerbate the swapping problem as different threads access different parts of the record set. (And if you can synchronize them like that, why even have multiple threads in the first place?)

The fundamental problem with both the sphere and the cube methods isn’t that the math per record is expensive, it’s that they must consider every record in the database every time you do a query. Getting those records into memory is the expensive part, not the math.

What we really need is some index that is small enough to be kept in memory that quickly eliminates lots of the records from consideration in the first place. Once we’re down to just a few candidates, they can be pulled into main memory and checked using as computationally-intensive algorithm as we like.

There’s no obvious way to build an index of hyperspheres, but there might be things we can do with hypercubes. Stay tuned.