High-Dimensional Spaces Are Counterintuitive, Part Five

All of this stuff about high dimensional geometry has been known for quite a while. The novel bit that this paper is all about is how to actually build a fast index for searching a high-dimensional space where the query has a strong likelihood of being junk. The idea that these smart Microsoft researchers came up with is called Redundant Bit Vectors, and it goes like this:

Suppose we’ve got a 100 dimensional space of 8-byte floats, with one million records to search. Each record point has a “tolerance hypercube” constructed around it — if a query point falls into that cube, then the record is a likely candidate for a match. Before we do any searches we build an index – building the index can take as long as we like so long as searching it is fast and the index is small.

To build the index, divide every dimension up into non-overlapping “buckets”. The algorithm for choosing the right number of buckets and the best bucket sizes isn’t particularly interesting, so without loss of generality let’s just assume that you’ve got four equally sized buckets for each dimension and move on. We’ve got 100 dimensions, four buckets per dimension, that’s 400 buckets. Now associate each bucket with a one-million-bit integer. That’s a big integer – what integer do we pick?

Consider the third bucket for the 57th dimension. There’s a million-bit integer associated with that bucket. Set the nth bit of the integer to 1 if the nth database record’s tolerance hypercube overlaps any part of the third bucket, considering only the 57th dimension, 0 if it does not. That is, this enormous integer answers the question “if the 57th dimension of the query is in the third bucket, what are all the database entries that we can immediately eliminate?” The ones that we can eliminate have zeros in their position in the integer.

Once we build those 400 million-bit integers, doing a query is a snap. First, figure out which 100 buckets the query point is in, one bucket per dimension. That gives us 100 one-million-bit integers. Binary AND those 100 integers together, and the million-bit integer that results has bits on only for those points which have tolerance hypercubes that overlap the query bucket in all 100 dimensions.

If the bucket sizes are chosen well then that should be a tiny number of points. Ideally, each of the four buckets per dimension would match only 25% of the points, but that’s not likely – maybe the tolerance rectangles are large enough and the dimensional data are interrelated enough that each bucket only eliminates 50% of the points. But even if that sub-optimal situation arises, each AND eliminates another 50%.

The first million-bit integer for the first dimension gets us down to 500000 candidate points, the second down to 250000, the third 125000, and by the time we get to the twentieth dimension we should be down to only a handful of points, if any. Detecting junk is very fast as the result will quickly go to all zeros. If it isn’t junk and you discover a handful of match candidates then you can pull out the old Cartesian distance metric to check whether the query point is really a good match for any of them.

Obviously there are no chips that can AND together one hundred integers of a million bits each – most chips do it in 32 or 64 bit chunks, two at a time. But by being clever about how the bits are stored you can ensure good cache locality, so that the chip is almost always ANDing together portions of the integers which are close together in memory. And by being clever about keeping track of when there are huge long runs of zeros in the result as it is being computed, you can make the ANDing process faster by skipping regions that you know are going to go to zero.

Is this index small enough to keep in main memory? 400 million-bit integers is 50 MB. That’s a big chunk of memory, but certainly more reasonable than keeping the entire 800 MB database in memory.

Note that this is still a linear-time algorithm – if we’ve got n records then we’ve got to work with an n-bit integer, and that will take O(n) time. But because AND is such a fast operation compared to loading records into memory and computing Cartesian distances, in practice this is a winning algorithm.

Well, that was entertaining. Next time we’ll get back down to earth — I’ll answer a few recent and not-so-recent questions about scripting.

High-Dimensional Spaces Are Counterintuitive, Part Four

It’s reasonably common to have a database containing a few thousand or million points in some high-dimensional space. If you have some “query point” in that space you might like to know whether there is a match in your database.

If you’re looking for an exact match then the problem is pretty easy — you just come up with a suitable hash algorithm that hashes points in your space and build a big old hash table. That’s extremely fast lookup. But what if you’re not looking for an exact match, you’re looking for a close match? Perhaps there is some error, either in the measurements that went into the database or the measurement of the query point, that needs to be accounted for.

Now the problem is “for a given query point, what is the closest point in the database to it?” That’s not a problem that’s amenable to hashing. (Well, you could use a Locality Sensitive Hash algorithm, but we’ll rule that out later.)

What if the closest point to the query point is so far away from the query point that it’s more likely that the query point simply has no match at all in the database? In that case we don’t really want the closest point, because that’s not really relevant.

Essentially you can think of every point in the database as being surrounded by a “probability sphere”. As you move farther away from a given point in the database, the probability that you have a match to that point gets smaller and smaller. Eventually its small enough that the probability that the query point is “junk” — not a match to any point in the database at all — gets larger than the probability that the closest point is a match.

To sum up the story so far: we’ve got a query point, which may or may not correspond to a point in our database of points. We need a way to say “is this point junk? If not, what are some reasonably close points in the database that might be matches?”

Here’s an idea for an algorithm: compute the distance between the query point and every point in the database. Discard all points where the distance is outside of a certain tolerance. Geometrically, we’re constructing a sphere of a certain radius around each point in the database. We check to see whether the query point is inside each sphere.

We know that the volume of an n-sphere is tiny. That’s good. If we do get a match then we know that the match is probably in a very small volume and therefore likely to be correct. However, we also know that such a system is not tolerant of small errors in many dimensions — because distances grow so fast, a small deviation in many dimensions leads to a big distance. That means that we probably will have to construct a sphere with a fairly large radius in order to allow for measurement error in many dimensions.

But that’s not really so bad. What’s bad about this scheme is that if there are a million points in the database then we have to calculate one million Cartesian distances for every query. In a 100-dimensional space that means computing 100 differences, 100 multiplications, 100 additions and one comparison a million times just to do one query.

We could build some optimizations in — we could check at each addition whether the total radius computed so far exceeds the tolerance and automatically discard the point. Maybe we could cut it down to on average 50 differences, multiplications and additions per point — but then we’ve just added in 50 comparison operations. No matter how you slice it, we’re doing a lot of math.

A hash table works by automatically discarding all but a tiny number of points, so that you just have to check a small number rather than the whole database. A Locality Sensitive Hash algorithm (which I might write about in another series) would work well here except that LSH algorithms have lousy performance if a large percentage of the queries are junk. Let’s assume that there are going to be a lot of junk queries. Is there some way that we can more rapidly find valid points?

We learned earlier that 99.9% of the volume of an n-sphere is contained by an n-cube with the same center as the sphere and an edge length fairly close to that of the radius.

Hmm.

Determining if a point is inside a cube is a lot less math than determining if a point is inside a sphere. With a cube you don’t need to compute any distance. You just compare the upper and lower boundaries of each dimension to the position of the point in that dimension and see if they overlap. For a 100-dimensional space, that’s 100 to 200 comparisons per point, and as soon as even one of the dimensions is bad, you can skip this cube and move on.

Maybe we can get it down to around a few dozen comparisons per point for a straightforward linear search. That’s pretty good compared to the distance metric.

We know that a cube of a given side is much, much more voluminous than a sphere of similar radius. We don’t really care about the 0.1% false negative rate caused by clipping off the bits of the sphere that are close to the axes. But what about the false positive rate of the huge volume of points that are inside the cube but not inside the sphere? These are easily dealt with: once we have winnowed the likely matches down to a small subset through the cheap cube method, we can do our more expensive spherical check on the small number of remaining candidates to eliminate false positives.

By this point your bogosity sense should be tingling.

This armchair performance analysis is completely bogus. Yes, ~70 comparisons is cheaper than ~100 subtractions, multiplications and additions. Who cares? That’s not the most expensive thing. Performance analysis always has to be about what the most expensive thing is.

Think about this database for a moment — a million records, each record is a 100-dimensional point in some space. Let’s suppose for the sake of argument that it’s a space of 8-byte double-precision floats. That’s 800 megabytes of memory. Now, there are certainly database servers out there that can keep an 800 meg file in physical memory, but we’re clearly pushing an envelope here. What if the database is large enough that there isn’t enough physical memory to keep the whole thing in all at once? At that point, the cost of iterating over all one million records becomes the cost of swapping big chunks of the database to disk and back.

Now consider what happens if there are multiple threads doing searches at the same time. Unless you can synchronize them somehow so that they all use the same bits of the database at the same time, you’re just going to exacerbate the swapping problem as different threads access different parts of the record set. (And if you can synchronize them like that, why even have multiple threads in the first place?)

The fundamental problem with both the sphere and the cube methods isn’t that the math per record is expensive, it’s that they must consider every record in the database every time you do a query. Getting those records into memory is the expensive part, not the math.

What we really need is some index that is small enough to be kept in memory that quickly eliminates lots of the records from consideration in the first place. Once we’re down to just a few candidates, they can be pulled into main memory and checked using as computationally-intensive algorithm as we like.

There’s no obvious way to build an index of hyperspheres, but there might be things we can do with hypercubes. Stay tuned.

High-Dimensional Spaces Are Counterintuitive, Part Three

My next book project is ready for copyediting, the wedding invitations are sent out, my bug count is under control, I’ve got a layer of levelling compound poured in the basement, there’s no more ivy on my roof, and I’m rigging my sailboat sometime this week (about six weeks later than I would have liked to, but let’s not be picky.) Life is getting back to normal in my little corner of Seattle. So what better time to continue with my little exploration of high-dimensional spaces?

Last time we noticed a few interesting facts:

  • most of the volume of a high-dimensional object is very near its surface.
  • small movements away from a point in many dimensions add up to a long total distance.

Suppose you have a bunch of points in space scattered randomly around some average. You would expect that if the points are more or less evenly distributed then the number of points in a given volume of the space would be proportional to the volume. (In fact, that’s probably how we’d define “more or less evenly distributed.”)

Let’s think of a low-dimensional example – points on a line. For example, suppose you’ve got a thousand 30-year-old women and you measure their heights. If the average is 150 cm, then you’d expect that the number of women that fall into the 2-cm range from 145-147 is roughly the same as from 150-152. There will probably be more in the range closer to the mean, but you’d expect that the order of magnitude would be roughly the same. It’s not like there’s going to be 100 in one range and 10 in the other. I’m assuming here that we’re considering regions close to the mean so that the distribution is pretty even throughout.

Now suppose you take a 2-cm range and a 1-cm range, both close to the mean. You would similarly expect that the 1-cm range would have roughly half as many points in it as the 2-cm range, right? The number of points in a region is roughly proportional to the size of the region.

What if the distribution is 2-dimensional? Suppose now you plot both the heights and hair lengths of these 1000 women. Again, there will be some average around which the points cluster, say (150 cm, 30 cm). Now imagine that you draw two circles of the same area close to the mean. Again, you’d expect that two circles of the same area would have roughly the same number of points inside them. The circle with its center closer to the mean probably has a few more, but you’d expect roughly the same number of points in each circular region.

Now what if you make one of those circles half the radius of the other? Then it will have only one quarter of the area, and therefore only on average one quarter as many points.

Now suppose that you have a bunch of points scattered around a 100-dimensional space, all clustered around a particular average point. Pick a hyperspherical region centered on the average, with radius large enough to contain most of the points.

We expect to find more points in regions of the space with more volume. We know that 99% of the volume of this n-sphere is concentrated in a thin shell near its surface. The conclusion is inescapable: points clustered around a mean in high dimensions are far more likely to be found in the high-volume narrow shell far from the mean, not in the low-volume region near the mean itself!

This again seems really counterintuitive at first but its not if you think about it for a while. In our example above, most of the women are of near average height. And most of the women are of near average hair length. But clearly the number of women that are near-average height AND near-average hair length is considerably smaller than either taken separately.

As we add more and more dimensions to the mix, the number of points where all measurements are near the mean point gets very small indeed. Of the six billion people in the world, how many of them are within 1% of average height and average hair length and average age and average income and average cholesterol level and average IQ? Probably a very, very small number. Most everyone is slightly abnormal somehow! And in a high-dimensional space, many small deviations from the mean adds up to a large distance from the mean. If you plotted 100 characteristics of 6 billion people in a 100-dimensional space, the number of people very near the middle of every axis is quite small; there’s hardly any volume there.

What the heck does this have to do with computer science? Suppose you have all that personal data in a database, you have a dead guy in the morgue with no ID, and you want to find a match in your database, if there is one? What if you have 100 variables about a song and you want to identify whether a given scrap of music is in your database? There are lots of real-world search problems that have sparse, large data sets spread out over a high-dimensional space.

Next time we’ll dig into that a little bit more.