Unknown's avatar

About ericlippert

http://ericlippert.com

Odious ambiguous overloads, part one

As you might have gathered, a lot of the decisions we have to make day-to-day here involve potential breaking changes on odd edge cases. Here’s another one.

Consider the following terrible but legal code:

public interface I1<U> {
  void M(U i);
  void M(int i);
}

My intense pain begins when the user writes:

public class C1 : I1<int> 
{

In the early version of the C# specification it was actually illegal to have an ambiguity like this, but the spec was changed so that when doing overload resolution on a call to M we can choose one, according to section 14.4.2.2:

If one [argument type] is non-generic, but the other is generic, then the non-generic is better.

But still, as you will soon see, we’re in a world of hurt for another reason, namely that this class must now implement both M(int i) and, uh, M(int i). Fortunately, a nonexplicit implementation binds to both methods, so this works just fine:

public class C1 : I1<int> 
{
  public void M(int i)
  {
    Console.WriteLine("class " + i);
  }
}

The method implements both versions of M and the contract is satisfied.  But we have problems if we try to do an explicit interface implementation:

public class C2 : I1<int> 
{
  void I1<int>.M(int i) 
  {
    Console.WriteLine("explicit " + i);
  }
}

Does this explicitly implement both members of I1?  Or just one?  If so, which one?

In the current compiler this code produces a terrible, terrible error:

error CS0535: 'C2' does not implement interface member 'I1<int>.M(int)'

Is that so?  It sure looks like it implements it!

What happens when we have both an explicit implementation and a class implementation?  The spec does not actually say what to do. It turns out that we end up in a situation where runtime behaviour depends on source code order of the interface! Check this out:

public interface I1<U> 
{
  void M(U i); // generic first
  void M(int i);
}
public interface I2<U> 
{
  void M(int i);
  void M(U i); // generic second
}
public class C3: I1<int>, I2<int> 
{
  void I1<int>.M(int i) 
  {
    Console.WriteLine("c3 explicit I1 " + i);
  }
  void I2<int>.M(int i) 
  {
    Console.WriteLine("c3 explicit I2 " + i);
  }
  public void M(int i) 
  {
    Console.WriteLine("c3 class " + i);
  }
}
class Test 
{
  static void Main() 
  {
    C3 c3 = new C3();
    I1<int> i1_c3 = c3;
    I2<int> i2_c3 = c3;
    i1_c3.M(101);
    i2_c3.M(102);
  }
}

What happens here is that the explicit interface implementation mappings in the class match the methods in the interfaces in a first-come-first-served manner:

  • void I1<int>.M(U) maps to explicit implementation void I1<int>.M(int i)
  • void I1<int>.M(int) maps to implicit implementation public void M(int i)
  • void I2<int>.M(int) maps to explicit implementation void I2<int>.M(int i)
  • void I2<int>.M(U) maps to implicit implementation public void M(int i)

Then (because of the aforementioned section 14.4.2.2) when we see

i1_c3.M(101);
i2_c3.M(102);

we prefer the typed-as-int versions to the generic substitution versions, so this program calls the two non-generic versions and produces the output:

c3 class 101
c3 explicit I2 102

And as you’d expect, if we force the compiler to pick the generic versions then we get similar behaviour:

static void Main() 
{
  C3 c3 = new C3();
  Thunk1<int>(c3,103);
  Thunk2<int>(c3, 104);
}
static void Thunk1<U>(I1<U> i1, U u) 
{
  i1.M(u);
}
static void Thunk2<U>(I2<U> i2, U u) 
{
  i2.M(u);
}

The binding of the overload resolution in the thunk bodies happens before the substitution of the type parameters, so these always bind to the generic versions of the methods. As you would expect from the mappings above, this outputs

c3 explicit I1 103
c3 class 104

Again, this shows that source code order has an unfortunate semantic import.

Given this unfortunate situation — no spec guidance and an existing implementation that behaves strangely — what would you do? (Of course “do nothing” is an option.) I’m interested to hear your ideas, and I’ll describe what we actually did next time.


Notes from 2020:

This article produced a large number of interesting comments, which I summarize in the next episode.

Five-Dollar Words for Programmers, Part One: Idempotence

Programmers, particularly those with a mathematical background, often use words from mathematics when describing their systems. Unfortunately, they also often do so without consideration of the non-mathematical background of their colleagues. I thought I might talk today a bit about the word “idempotent”. This is a very easy concept but lots of people don’t know that there is a word for it.

There are two closely related mathematical definitions for “idempotent. First, a value is “idempotent under function foo” if the result of doing foo to the value results in the value right back again. Another common way to say this is that the value is a fixpoint of foo.

Second, a function is “idempotent” if the result of doing it twice — where we are feeding the output of the first call into the second call — is exactly the same as the result of doing it once. Or, in other words, every output of the function is idempotent under it. Or in even more other words, every output of an idempotent function is a fixpoint of the function.

The most trivial mathematical example of the second kind is the constant function. If f(x) = c, then clearly f(x) = f(f(x)) = f(f(f(x))) … so f is idempotent (and the constant is idempotent under it).

The identity function f(x) = x is also idempotent; I’m sure you see why.

The function that takes a finite set of numbers and returns a set containing only its largest element is idempotent (and every one-element set is idempotent under it). I’m sure you can think of lots of examples of idempotent functions.

To get a handle on the other sense, pick an operation — say, doubling. The only value which is idempotent under that operation is zero. The operation “subtracting any non-zero value” has no idempotent values. Squaring has two idempotent values, zero and one.

The way programmers use “idempotent” is slightly different, but this concept comes up all the time in practical programming, particularly around caching logic. Usually when used in the computer science sense we don’t mean that the effect of the function is invariant under composition of the function with itself, but rather that it is invariant over any number of calls greater than one. For example, I don’t know how many times I’ve written:

HRESULT GetTypeLibCreator(ICreateTypeLib2 ** ppctl) 
{
  if (this->m_pctl == NULL) 
  {
    HRESULT hr; 
    hr = CreateTypeLib2(SYS_WIN32, pszName, &this->m_pctl);
    if (FAILED(hr)) 
      return hr;
  }
  *ppctl = this->m_pctl;
  this->m_pctl->AddRef();
  return S_OK;
}

A nice little idempotent function — calling it two, three, or any other number of times with the same arguments has exactly the same result as calling it once.

The place you see the other characterization of idempotence all the time is in C++ header files. Include a needed header zero times and you’ll get “not defined” errors. Accidentally include it twice and you’ll get “redefinition” errors. It’s a major pain to make sure that every header file is included exactly once. Therefore, most headers use some trick to make them idempotent under the inclusion operation:

#ifndef STUFF_H_INCLUDED
#define STUFF_H_INCLUDED

// headers here

#endif // STUFF_H_INCLUDED

or in more modern systems, the “#pragma once” directive makes headers idempotent under inclusion.

Floating Point Arithmetic, Part One

A month ago I was discussing some of the issues in integer arithmetic, and I said that issues in floating point arithmetic were a good subject for another day. Over the weekend I got some questions from a reader about floating point arithmetic, so this seems like as good a time as any to address them.

Before I talk about some of the things that can go terribly wrong with floating point arithmetic, it’s helpful (and character building) to understand how exactly a floating point number is represented internally.

To distinguish between decimal and binary numbers, I’m going to do all binary numbers in fixed-width.

Here’s how floating point numbers work. A float is 64 bits. Of that, one bit represents the sign: 0 is positive, 1 is negative.

Eleven bits represent the exponent. To determine the exponent value, treat the exponent field as an eleven-bit unsigned integer, then subtract 1023. However, note that the exponent fields 00000000000 and 11111111111 have special meaning, which we’ll come to later.

The remaining 52 bits represent the mantissa.

To compute the value of a float, here’s what you do.

  • Write out all 52 bits of the mantissa.
  • Stick a 1. onto the left hand side.
  • Compute that value as a 53 bit fraction with 52 fractional places.
  • Multiply that by two to the power of the given exponent value.
  • Sign it appropriately.

So for example, the number -5.5 is represented like this: (sign, exponent, mantissa)

(1, 10000000001, 0110000000000000000000000000000000000000000000000000)

The sign is 1, so it is a negative number.

The exponent is 1025 – 1023 = 2.

Put a 1. on the top of the mantissa and you get

1.0110000000000000000000000000000000000000000000000000

which in decimal is 1.375 and sure enough, -1.375 x 22 = -5.5

This system is nice because it means that every number in the range of a float has a unique representation, and therefore doesn’t waste bits on duplicates.

However, you might be wondering how zero is represented, since every bit pattern has 1. plunked onto the left side. That’s where the special values for the exponent come in.

If the exponent is 00000000000, then the float is considered a “denormal”. It gets 0. plunked onto the left side, not 1., and the exponent is assumed to be -1022.

This has the nice property that if all bits in the float are zero, it is representing zero.

Note that this lets you represent smaller numbers than you would be able to otherwise, as we’ll see, though you pay the price of lower precision. Essentially, denormals exist so that the chip can do “graceful underflow” — represent tiny values without having to go straight to zero.

If the exponent is 11111111111 and the mantissa is all zeros, that’s Infinity.

If the exponent is 11111111111 and the mantissa is not all zeros, that’s considered to be Not A Number — this is a bit pattern reserved for errors.

So the biggest positive normalized float is

(0, 11111111110, 1111111111111111111111111111111111111111111111111111)

which is

1.1111111111111111111111111111111111111111111111111111 x 21023.

The smallest positive normalized float is

(0, 00000000001, 0000000000000000000000000000000000000000000000000000)

which is

1.000 x 2-1022

The biggest positive denormalized float is:

(0, 00000000000, 1111111111111111111111111111111111111111111111111111)

which is

0.1111111111111111111111111111111111111111111111111111 x 2-1022

The smallest positive denormalized float is

(0, 00000000000, 0000000000000000000000000000000000000000000000000001)

which is

0.0000000000000000000000000000000000000000000000000001 x 2-1022 = 2-1074

 


Next time on FAIC: floating point math is nothing like real number math.

250% of What, Exactly?

I just got a question this morning about how to take two collections of items and determine how many of those items had the same name. The user had written this straightforward but extremely slow VBScript algorithm:

For Each Frog In Frogs
  For Each Toad In Toads
    If Frog.Name = Toad.Name Then
      SameName = SameName + 1
      Exit For
    End If
  Next
Next

There were about 5000 frogs, 3000 toads and 1500 of them had the same name. Every one of the 3500 unsuccessful searches checked all 3000 toads, and the 1500 successful searches on average checked 1500 toads each. Each time through the inner loop does one loop iteration, two calls to the Name property, one string comparison. Add all those up and you get roughly 50 million function calls to determine this count.

This code has been somewhat simplified – the actual user code was doing more work inside the loop, including calls to WMI objects. The whole thing was taking 10+ minutes, which is actually pretty fast considering how much work was being done. Each individual function call was only taking a few microseconds, but fifty million calls adds up!

Now, we’ve been down this road before in this blog (herehere and here) and so of course I recommended building a faster lookup table rather than doing a full search through the collection every time.

Set FrogLookup = CreateObject("Scripting.Dictionary")
For Each Frog In Frogs
  FrogLookup(Frog.Name) = Frog
Next
For Each Toad In Toads
  If FrogLookup.Exists(Toad.Name) Then SameName = SameName + 1
Next

Which is much, much faster. That’s only about 16 thousand function calls. Now, this is maybe not an apples-to-apples comparison of function calls, but we at least have good reason to believe that this is going to be several times faster. 

And indeed it was. But I’m still not sure how much, which brings me to the actual subject of today’s blog. The user reported that the new algorithm was “250% better”. I hear results like this all the time, and I always have to ask to clarify what it means. You can’t just say “n% better” without somehow also communicating what you’re measuring.


(UPDATE: This reported result understandably confused some readers.  Clearly the new loop here is thousands of times faster, not a mere 250% faster. As I said before, the sketch above is highly simplified code. The real-world code included many thousands of additional calls to WMI objects which were not eliminated by this optimization. Eliminating these 50 million function calls helped — you should always eliminate the slowest thing first!  But doing so also exposed a new “hot spot” that needed further optimization.  However, the point of this article is not the benefits of using lookup tables, but rather that using unexplained percentages to report performance results is potentially misleading.  The result above is merely illustrative.  See the comments for more details.)


Allow me to illustrate. Suppose I have a web server that is serving up ten thousand pages every ten seconds. I make a performance improvement to the page generator so that it is now serving up fifteen thousand pages every ten seconds. I can sensibly say:

  • performance has improved by 50%, because we are now serving up 5000 more pages every ten seconds, and 5000 is 50% of the original 10000. In this world, any positive percentage is good.
  • performance is now 150% of original performance because 15000 is 150% of 10000. In this world, 0%-100% worse or the same, 100%+ is good.
  • We’ve gone from 1000 microseconds per page to 667 microseconds per page, saving 333 microseconds per page. 333 is 33% of 1000, so we’ve got a 33% performance improvement.  In this world, 0% is bad, 100% is perfect,  more than 100% is nonsensical.

If I can sensibly say that performance is better by 50%, 150% and 33%, all referring to exactly the same improvement, then I cannot actually be communicating any fact to my listeners! They need to know whether I’m measuring speed or time, and if speed, whether I’m comparing original speed to new speed or original speed to the difference.

So what is “250%” better? Clearly not raw timings. If this loop took 10 minutes to run before, 250% of 10 minutes is 25 minutes, but surely it is not running in -15 minutes now! I assume that the measurement is speed — say, number of loops runnable per hour. If before it took ten minutes and therefore we could run this loop six times per hour, and now it is 250% better, 250% of 6 is 15, and this is “better”, so we need to add. So that’s 21 loops per hour, or about 3 minutes per loop. Had the user accidentally said “250% faster” but meant to say “250% of previous performance” then we’d conclude that 250% of 6 per hour is 15 per hour, so now we’ve got 4 minutes per loop.

And of course, this is assuming that the person reporting the improvement is actually trying to communicate facts to the audience. Be wary! Since 33%, 50% and 150% are all sensible ways to report this improvement, which do you think someone who wants to impress you is going to choose? There are opportunities here for what Kee Dewdney calls “percentage pumping” and other shyster tricks for making weak numbers look more impressive. Professor Dewdney’s book on the subject, “200% of Nothing”, is quite entertaining.

The moral of the story is: please do not report performance improvements in percentages. Report performance improvements in terms of raw numbers with units attached, such as “microseconds per call, before and after change”, or “pages per second, before and after change”. That gives the reader enough information to actually understand the result.

(And the other moral is, of course, lookup tables are your friends.)

High-Dimensional Spaces Are Counterintuitive, Part Five

All of this stuff about high dimensional geometry has been known for quite a while. The novel bit that this paper is all about is how to actually build a fast index for searching a high-dimensional space where the query has a strong likelihood of being junk. The idea that these smart Microsoft researchers came up with is called Redundant Bit Vectors, and it goes like this:

Suppose we’ve got a 100 dimensional space of 8-byte floats, with one million records to search. Each record point has a “tolerance hypercube” constructed around it — if a query point falls into that cube, then the record is a likely candidate for a match. Before we do any searches we build an index – building the index can take as long as we like so long as searching it is fast and the index is small.

To build the index, divide every dimension up into non-overlapping “buckets”. The algorithm for choosing the right number of buckets and the best bucket sizes isn’t particularly interesting, so without loss of generality let’s just assume that you’ve got four equally sized buckets for each dimension and move on. We’ve got 100 dimensions, four buckets per dimension, that’s 400 buckets. Now associate each bucket with a one-million-bit integer. That’s a big integer – what integer do we pick?

Consider the third bucket for the 57th dimension. There’s a million-bit integer associated with that bucket. Set the nth bit of the integer to 1 if the nth database record’s tolerance hypercube overlaps any part of the third bucket, considering only the 57th dimension, 0 if it does not. That is, this enormous integer answers the question “if the 57th dimension of the query is in the third bucket, what are all the database entries that we can immediately eliminate?” The ones that we can eliminate have zeros in their position in the integer.

Once we build those 400 million-bit integers, doing a query is a snap. First, figure out which 100 buckets the query point is in, one bucket per dimension. That gives us 100 one-million-bit integers. Binary AND those 100 integers together, and the million-bit integer that results has bits on only for those points which have tolerance hypercubes that overlap the query bucket in all 100 dimensions.

If the bucket sizes are chosen well then that should be a tiny number of points. Ideally, each of the four buckets per dimension would match only 25% of the points, but that’s not likely – maybe the tolerance rectangles are large enough and the dimensional data are interrelated enough that each bucket only eliminates 50% of the points. But even if that sub-optimal situation arises, each AND eliminates another 50%.

The first million-bit integer for the first dimension gets us down to 500000 candidate points, the second down to 250000, the third 125000, and by the time we get to the twentieth dimension we should be down to only a handful of points, if any. Detecting junk is very fast as the result will quickly go to all zeros. If it isn’t junk and you discover a handful of match candidates then you can pull out the old Cartesian distance metric to check whether the query point is really a good match for any of them.

Obviously there are no chips that can AND together one hundred integers of a million bits each – most chips do it in 32 or 64 bit chunks, two at a time. But by being clever about how the bits are stored you can ensure good cache locality, so that the chip is almost always ANDing together portions of the integers which are close together in memory. And by being clever about keeping track of when there are huge long runs of zeros in the result as it is being computed, you can make the ANDing process faster by skipping regions that you know are going to go to zero.

Is this index small enough to keep in main memory? 400 million-bit integers is 50 MB. That’s a big chunk of memory, but certainly more reasonable than keeping the entire 800 MB database in memory.

Note that this is still a linear-time algorithm – if we’ve got n records then we’ve got to work with an n-bit integer, and that will take O(n) time. But because AND is such a fast operation compared to loading records into memory and computing Cartesian distances, in practice this is a winning algorithm.

Well, that was entertaining. Next time we’ll get back down to earth — I’ll answer a few recent and not-so-recent questions about scripting.

High-Dimensional Spaces Are Counterintuitive, Part Four

It’s reasonably common to have a database containing a few thousand or million points in some high-dimensional space. If you have some “query point” in that space you might like to know whether there is a match in your database.

If you’re looking for an exact match then the problem is pretty easy — you just come up with a suitable hash algorithm that hashes points in your space and build a big old hash table. That’s extremely fast lookup. But what if you’re not looking for an exact match, you’re looking for a close match? Perhaps there is some error, either in the measurements that went into the database or the measurement of the query point, that needs to be accounted for.

Now the problem is “for a given query point, what is the closest point in the database to it?” That’s not a problem that’s amenable to hashing. (Well, you could use a Locality Sensitive Hash algorithm, but we’ll rule that out later.)

What if the closest point to the query point is so far away from the query point that it’s more likely that the query point simply has no match at all in the database? In that case we don’t really want the closest point, because that’s not really relevant.

Essentially you can think of every point in the database as being surrounded by a “probability sphere”. As you move farther away from a given point in the database, the probability that you have a match to that point gets smaller and smaller. Eventually its small enough that the probability that the query point is “junk” — not a match to any point in the database at all — gets larger than the probability that the closest point is a match.

To sum up the story so far: we’ve got a query point, which may or may not correspond to a point in our database of points. We need a way to say “is this point junk? If not, what are some reasonably close points in the database that might be matches?”

Here’s an idea for an algorithm: compute the distance between the query point and every point in the database. Discard all points where the distance is outside of a certain tolerance. Geometrically, we’re constructing a sphere of a certain radius around each point in the database. We check to see whether the query point is inside each sphere.

We know that the volume of an n-sphere is tiny. That’s good. If we do get a match then we know that the match is probably in a very small volume and therefore likely to be correct. However, we also know that such a system is not tolerant of small errors in many dimensions — because distances grow so fast, a small deviation in many dimensions leads to a big distance. That means that we probably will have to construct a sphere with a fairly large radius in order to allow for measurement error in many dimensions.

But that’s not really so bad. What’s bad about this scheme is that if there are a million points in the database then we have to calculate one million Cartesian distances for every query. In a 100-dimensional space that means computing 100 differences, 100 multiplications, 100 additions and one comparison a million times just to do one query.

We could build some optimizations in — we could check at each addition whether the total radius computed so far exceeds the tolerance and automatically discard the point. Maybe we could cut it down to on average 50 differences, multiplications and additions per point — but then we’ve just added in 50 comparison operations. No matter how you slice it, we’re doing a lot of math.

A hash table works by automatically discarding all but a tiny number of points, so that you just have to check a small number rather than the whole database. A Locality Sensitive Hash algorithm (which I might write about in another series) would work well here except that LSH algorithms have lousy performance if a large percentage of the queries are junk. Let’s assume that there are going to be a lot of junk queries. Is there some way that we can more rapidly find valid points?

We learned earlier that 99.9% of the volume of an n-sphere is contained by an n-cube with the same center as the sphere and an edge length fairly close to that of the radius.

Hmm.

Determining if a point is inside a cube is a lot less math than determining if a point is inside a sphere. With a cube you don’t need to compute any distance. You just compare the upper and lower boundaries of each dimension to the position of the point in that dimension and see if they overlap. For a 100-dimensional space, that’s 100 to 200 comparisons per point, and as soon as even one of the dimensions is bad, you can skip this cube and move on.

Maybe we can get it down to around a few dozen comparisons per point for a straightforward linear search. That’s pretty good compared to the distance metric.

We know that a cube of a given side is much, much more voluminous than a sphere of similar radius. We don’t really care about the 0.1% false negative rate caused by clipping off the bits of the sphere that are close to the axes. But what about the false positive rate of the huge volume of points that are inside the cube but not inside the sphere? These are easily dealt with: once we have winnowed the likely matches down to a small subset through the cheap cube method, we can do our more expensive spherical check on the small number of remaining candidates to eliminate false positives.

By this point your bogosity sense should be tingling.

This armchair performance analysis is completely bogus. Yes, ~70 comparisons is cheaper than ~100 subtractions, multiplications and additions. Who cares? That’s not the most expensive thing. Performance analysis always has to be about what the most expensive thing is.

Think about this database for a moment — a million records, each record is a 100-dimensional point in some space. Let’s suppose for the sake of argument that it’s a space of 8-byte double-precision floats. That’s 800 megabytes of memory. Now, there are certainly database servers out there that can keep an 800 meg file in physical memory, but we’re clearly pushing an envelope here. What if the database is large enough that there isn’t enough physical memory to keep the whole thing in all at once? At that point, the cost of iterating over all one million records becomes the cost of swapping big chunks of the database to disk and back.

Now consider what happens if there are multiple threads doing searches at the same time. Unless you can synchronize them somehow so that they all use the same bits of the database at the same time, you’re just going to exacerbate the swapping problem as different threads access different parts of the record set. (And if you can synchronize them like that, why even have multiple threads in the first place?)

The fundamental problem with both the sphere and the cube methods isn’t that the math per record is expensive, it’s that they must consider every record in the database every time you do a query. Getting those records into memory is the expensive part, not the math.

What we really need is some index that is small enough to be kept in memory that quickly eliminates lots of the records from consideration in the first place. Once we’re down to just a few candidates, they can be pulled into main memory and checked using as computationally-intensive algorithm as we like.

There’s no obvious way to build an index of hyperspheres, but there might be things we can do with hypercubes. Stay tuned.

High-Dimensional Spaces Are Counterintuitive, Part Three

My next book project is ready for copyediting, the wedding invitations are sent out, my bug count is under control, I’ve got a layer of levelling compound poured in the basement, there’s no more ivy on my roof, and I’m rigging my sailboat sometime this week (about six weeks later than I would have liked to, but let’s not be picky.) Life is getting back to normal in my little corner of Seattle. So what better time to continue with my little exploration of high-dimensional spaces?

Last time we noticed a few interesting facts:

  • most of the volume of a high-dimensional object is very near its surface.
  • small movements away from a point in many dimensions add up to a long total distance.

Suppose you have a bunch of points in space scattered randomly around some average. You would expect that if the points are more or less evenly distributed then the number of points in a given volume of the space would be proportional to the volume. (In fact, that’s probably how we’d define “more or less evenly distributed.”)

Let’s think of a low-dimensional example – points on a line. For example, suppose you’ve got a thousand 30-year-old women and you measure their heights. If the average is 150 cm, then you’d expect that the number of women that fall into the 2-cm range from 145-147 is roughly the same as from 150-152. There will probably be more in the range closer to the mean, but you’d expect that the order of magnitude would be roughly the same. It’s not like there’s going to be 100 in one range and 10 in the other. I’m assuming here that we’re considering regions close to the mean so that the distribution is pretty even throughout.

Now suppose you take a 2-cm range and a 1-cm range, both close to the mean. You would similarly expect that the 1-cm range would have roughly half as many points in it as the 2-cm range, right? The number of points in a region is roughly proportional to the size of the region.

What if the distribution is 2-dimensional? Suppose now you plot both the heights and hair lengths of these 1000 women. Again, there will be some average around which the points cluster, say (150 cm, 30 cm). Now imagine that you draw two circles of the same area close to the mean. Again, you’d expect that two circles of the same area would have roughly the same number of points inside them. The circle with its center closer to the mean probably has a few more, but you’d expect roughly the same number of points in each circular region.

Now what if you make one of those circles half the radius of the other? Then it will have only one quarter of the area, and therefore only on average one quarter as many points.

Now suppose that you have a bunch of points scattered around a 100-dimensional space, all clustered around a particular average point. Pick a hyperspherical region centered on the average, with radius large enough to contain most of the points.

We expect to find more points in regions of the space with more volume. We know that 99% of the volume of this n-sphere is concentrated in a thin shell near its surface. The conclusion is inescapable: points clustered around a mean in high dimensions are far more likely to be found in the high-volume narrow shell far from the mean, not in the low-volume region near the mean itself!

This again seems really counterintuitive at first but its not if you think about it for a while. In our example above, most of the women are of near average height. And most of the women are of near average hair length. But clearly the number of women that are near-average height AND near-average hair length is considerably smaller than either taken separately.

As we add more and more dimensions to the mix, the number of points where all measurements are near the mean point gets very small indeed. Of the six billion people in the world, how many of them are within 1% of average height and average hair length and average age and average income and average cholesterol level and average IQ? Probably a very, very small number. Most everyone is slightly abnormal somehow! And in a high-dimensional space, many small deviations from the mean adds up to a large distance from the mean. If you plotted 100 characteristics of 6 billion people in a 100-dimensional space, the number of people very near the middle of every axis is quite small; there’s hardly any volume there.

What the heck does this have to do with computer science? Suppose you have all that personal data in a database, you have a dead guy in the morgue with no ID, and you want to find a match in your database, if there is one? What if you have 100 variables about a song and you want to identify whether a given scrap of music is in your database? There are lots of real-world search problems that have sparse, large data sets spread out over a high-dimensional space.

Next time we’ll dig into that a little bit more.

High-Dimensional Spaces Are Counterintuitive, Part Two

The volume of an n-cube of edge length s is easy to work out. A 2-cube has s2 units of area. A 3-cube has s3 units of volume. A 4-cube has s4 units of 4-volume, and so on — an n-cube has sn units of n-volume. If the n-cube has edge of s>1, say s=2, then clearly the n-volume dramatically increases as the dimensionality increases — each dimension adds a lot more “room” to the n-cube.

A 2-sphere (circle) is pretty close in area to the smallest 2-cube (square) that encloses it — sure, you lose some area at the four corners, but not a whole lot. Though the circle is far from the square at the four corner, it is very close to the square at the four sides. A circle has about 75% the area of its enclosing square. But a 3-sphere inside a the smallest 3-cube that encloses it is far from eight corners and close to only six sides. A 3-sphere is about half the volume of the 3-cube. As you go up in dimensions, you get more and more corners that are far from the n-sphere — there are 2n corners and only 2n sides so the comparative volume of the sphere goes down.

In fact, you don’t even need to compare n-spheres to n-cubes — after you reach 5 dimensions, the n-volume of an n-sphere starts going down, not up, as dimensionality increases. With some pretty easy calculus you can show that the n-volume of an n-sphere of radius r is:

V1 = 2 r
V2 = π r2
Vn = Vn-2 2 π r2 / n

For any fixed radius this rapidly approaches zero as n gets big. Pick a big n, say 100. The volume of a 100-sphere is going to be (π r2 /50) x (π r2 /49) x (π r2/48) … (π r2/1). Suppose r is 1 — then all of those terms except for the last few are going to be quite a bit smaller than 1. Every time you add more dimensions, the n-volume of the unit n-sphere gets smaller and smaller even as the n-volume of the smallest n-cube that encloses the unit n-sphere gets exponentially larger and larger!

Here’s another weird fact about the volume of hypersolids. Consider two squares, one inside the other. How big does the small square have to be in order to have, say, 1% the area of the larger square? That’s pretty easy. If the inner square has 10% the edge length of the outer square, then it has 1% of the area of the outer square.

What about nested 3-cubes? An inner 3-cube with edges 10% the length of the edge of the outer 3-cube would have 0.1% the volume, too small. Rather, it needs to havean edge about 21% of the edge of the outer 3-cube, because .21 x .21 x .21 = about 0.01.

What about a nested n-cube? In order to have 1% the n-volume of the outer n-cube, the inner n-cube needs to have an edge of (0.01)1/n of the outer n-cube side. For n=100, that’s 0.955. Think about that for a moment. You’ve got two 100-cubes, one has edges 2 units long, the other has edges 1.91 units long. The larger n-cube contains ONE HUNDRED TIMES more volume.

Try to visualize the smaller n-cube being entirely inside the larger n-cube, the two n-cubes having the same central point. Now wrap your mind around the fact that the smaller n-cube is 1% the volume of the larger. The conclusion is unavoidable: in high dimensions the vast majority of the volume of a solid is concentrated in a thin shell near its surface! Remember, there’s 2100 corners in a 100-cube, and that makes for a lot of space to put stuff.

It’s counterintuitive because the very idea of “near” is counterintuitive in higher dimensions. Every time you add another dimension, there’s more room for points to be farther apart.

The distance between opposite corners of a square of edge 2 is 2√2. The distance between opposite corners of a 3-cube is 2√3, quite a bit bigger. The distance between opposite corners of a 100-cube is 2√100 = 20 units! There are a whole lot of dimensions to move through, and that adds distance.

We could make the same argument for an n-sphere and show that the vast majority of its (comparatively tiny) volume is also in a thin shell near the surface; I’m sure you can see how the argument would go, so I won’t bother repeating myself.

Because distance is so much more “expensive” in higher dimensions, this helps explain why n-spheres have so much less volume than n-cubes. Consider a 100-cube of edge 2 centered on the origin enclosing a 100-sphere of diameter 2, also centered on the origin. The point (1,0,0,0,0,0…,0) is on both the 100-cube and the sphere, and is 1 unit from the origin. The point (1,1,1,…,1) is on the 100-cube and is ten units away from the origin. But a 100-sphere by definition is the set of points equidistance from the origin, and distance is expensive in high dimensions. The nearest point on the 100-sphere to that corner is (0.1, 0.1, 0.1, …, 0.1), 9 units away from the corner of the 100-cube. Now its clear just how tiny the 100-sphere is compared to the 100-cube.

OK, so far we’ve been considering n-cubes that entirely enclose n-spheres, ie, an n-cube of edge length 2 that encloses a unit n-sphere, kissing the sphere at 2n points. But we know that this n-cube has ginormously more volume than the n-sphere it encloses and that most of that volume is near the edges and corners. What if we abandon the constraint that the n-cube contains 100% of the n-sphere’s volume. After all, there are only 200 points where the 100-sphere kisses the 100-cube, and that’s not very many at all.

Suppose we want an 100-cube that contains 99.9% of the volume of the unit 100-sphere. We can cover virtually all of the volume of the 100-sphere with an 100-cube of edge 0.7 instead of 2. Sure, we’re missing (1,0,0,0,…,0), but we’re still hitting (0.1,0.1,0.1,…) with huge amounts of room to spare. Most of the volume inside the 100-sphere isn’t near the 200 points with coordinates near the axes.

How much do we reduce the volume of the 100-cube by shrinking it from 2 on the edge to 0.7? We go from 2100 n-units of volume to 0.7100, a factor of around 4×1045 times smaller volume! And yet we still enclose virtually all the volume of the 100-sphere. The corner of the smaller 100-cube at (0.35, 0.35, 0.35, …) is now only 2.5 units away from (0.1, 0.1, …) instead of 9 units away. This is a much better approximation of the unit 100-sphere. It’s still hugely enormous compared to the unit 100-sphere in terms of sheer volume, but look at how much volume we save by approximating the 100-sphere as a small 100-cube!

Feeling dizzy yet? Next time we’ll see that these facts

  • n-spheres are tiny compared to n-cubes
  • hypersolids have most of their volume close to their surfaces
  • you can enclose almost all the volume of an n-sphere with a small n-cube

have major repercussions when we throw probability theory into the mix.

High-Dimensional Spaces Are Counterintuitive, Part One

A friend of mine over in Microsoft Research pointed out to me the other day that high-dimensional spaces are really counterintuitive. He’d just attended a lecture by the research guys who wrote this excellent paper and we were geeking out at a party about it. I found this paper quite eye-opening and I thought I might talk a bit about some of the stuff that’s in here at a slightly less highfalutin level — the paper assumes a pretty deep understanding of high-dimensional spaces from the get-go.

**************************

It’s hard to have a geometrical picture in your head of a more than three-dimensional space. I usually have to use one of two analogies. The first analogy I like goes like this: think of a line — one dimensional. Imagine that you have a slider control that determines your position on that line, from, say, -1 to 1, left-to-right. That’s pretty visually clear.

Add another slider that determines your up-and-down position, and you’ve got a square. Each point on the square has a unique set of slider positions.

Add another slider that determines your out-and-in position, and you’ve got a cube. Again, these are easy to picture. Every point in the cube has a unique combination of slider positions that gets you to that point.

Now think of a cube with a slider control below it that lets you slide from intense red on one end through dark red and to black on the other. Now you’ve got four axes you can move around — height, width, depth and redness. The top right front corner of the bright red cube is a certain “colour distance” from the corner of the top right front black cube. That this is not a spatial dimension isn’t particularly relevant; we’re just picturing a dimension as redness for convenience.

Every time you want to add another dimension, add another slider — just make sure that whatever is sliding is completely independent of every other dimension. Once you’ve added green and blue sliders then you’ve got a six-dimensional hypercube. The “distance” between any two 6-d points is a function of how much you have to move how many sliders to get from one to the other.

That analogy gets across one of the key ideas of multi-dimensional spaces: that each dimension is simply another independent degree of freedom through which you can move. But this is a quite mathematical and not very geometric way of thinking about dimensionality, and I want to think about the geometry of these objects. Let’s abandon this analogy.

The second analogy is a little bit more geometrical. Think of a line, say two units long. Now associate every point on that line with another line, also two units long “crossing” it at the new line’s center. Clearly that’s a filled-in square — after all, every point along one side of a square has a straight line coming out from it perpendicularly. In our slider analogy, one slider determines the point along the “main line”, and the second determines how far to go along its associated line.

Think of another line, but this time, associate every point on it with a square. That’s a solid cube.

Now think of yet another line. Associate every point on it with a cube, and you’ve got a 4-cube. At this point it gets hard to visualize, but just as a cube is an infinite number of equally-sized squares stuck together along a line, so is a 4-cube an infinite number of 3-cubes stuck together along a line. Similarly, a 5-cube is a line of 4-cubes, and so on.

Where things get weird is when you start to think about hyperspheres instead of hypercubes. Hyperspheres have some surprising properties that do not match our intuition, given that we only have experience with two and three dimensional spheres. (2-spheres are of course normally called “circles”.)

The definition of a hypersphere is pretty simple — like a 2-sphere or 3-sphere, a hypersphere is the collection of points that are all the same distance from a given center point. (But distance works strangely in higher dimensions, as we’ll see in future episodes!)

Its hard to picture a hypercube; it’s even hard to picture a hypersphere. The equivalent analogy for n-spheres requires us to think about size. Again, imagine a line two units long. Associate with each point on the line another line crossing at the middle. But this time, the associated lines are of different lengths. The lines associated with the end points are tiny, and the lines associated with the middle are longer. This describes a circular disk — for each point along the diameter of a circle you can draw a perpendicular line through the point extending to the boundaries of the disk on each side.

Now do the same thing again. Take a line, and associate each point on the line with a circle. If the circles are all the same size, you have a cylinder. But if they vary from small at the ends to big in the middle, you’ve got a sphere. Successive cross-sections of a sphere are all circles, but they start small and get big and then get small again.

Now do the same thing again. Take a line and associate each point on the line with a sphere, small at the ends and big in the middle, and you’ve got a 4-sphere. Successive “cross sections” of a 4-sphere are 3-spheres of varying size. Keep going to 5-, 6-, etc, spheres.

A circle of diameter 2 fits into a square of edge length 2, and a sphere of diameter 2 fits into a cube of edge length 2. Clearly an n-sphere of diameter two fits exactly into an n-cube of edge length two — the n-sphere “kisses” the center of each face of the n-cube. You can’t make the n-cube smaller without the n-sphere poking out of it somewhere.

But things start getting weird when you consider the volume of an n-sphere. Tomorrow we’ll compare the volume of an n-sphere to the volume of an n-cube, and discover some surprising and counterintuitive things about where that volume is.

You Want Salt With That? Part Four: Challenge-Response

My friend Kristen asked me over the weekend when I was going to stop blogging about crypto math and say something funny again. Everyone’s a critic!

Patience, my dear. Today, the final entry in my series on salt. Tomorrow, who knows?


So far we’ve got a system whereby the server does not actually know what your password is, it just has a salted hash of your password. But we’re still sending the password over the wire in clear text, which seems risky.

System #4

What if the hash goes over the wire? The client sends the user name to the server. The server sends the password salt to the client. The client appends the password to the password salt, hashes the result, and sends the hash to the server. The server compares the hash from the client to the hash in the user list. Now the password never goes over the wire at all. Awesome!

Unfortunately, this system is worse. In previous systems the eavesdropper got the password; in this system the eavesdropper gets the salted hash. The eavesdropper can then write their own client which sends that username and salted hash to the server.

And the “steal the password list” attack just came back; now an attacker who gets the password list gets all the salted hashes, and can use them to impersonate the user. Sure, the attacker will still find it hard to deduce the original password from the salted hash, but he doesn’t need to deduce the password anymore. (Unless of course the attacker is attempting to deduce your password on one system so that he can use it to attack another system that you use. This is why it’s a bad idea to use the same password for two different systems.)

Essentially we’ve turned the salted hash into a “password equivalent”. Can we fix this up?

System #5

How about this?

The client sends the username to the server. The server creates a second random salt which is NOT stored in the user list. This random salt is usedonly once — we either make it so big that odds of generating it again are low, or keep a list of previously used random salts and pick a new one if we have a collision. We’ll call the random salt “the challenge” for reasons which will become apparent in a minute.

The server sends the user’s password salt and the challenge to the client. The client appends the password salt to the password and hashes the salted password. It converts the salted hash to a string, appends the string to the challenge, and hashes the resulting string to form the “response” hash. The response is sent across the wire.

The server then does the same thing – converts the stored salted password hash to a string, appends it to the challenge, and hashes the resulting string. If the response from the client is equal to the value the server just computed, then the client must have computed the same salted hash, and therefore knows the password.

Now what does the eavesdropper know? The eavesdropper knows the username, the password salt, the challenge and the response. The eavesdropper has enough information to launch an offline dictionary attack against that user. But since the random challenge is never going to be used again, the fact that the attacker knows a valid challenge/response pair is essentially irrelevant.

This system has the downside that an attacker who gets the password file has obtained password equivalents, so no dictionary attack is necessary. (Unless of course the attacker is trying to determine a user’s password in order to try it against the user’s account on a different system!)

Fortunately, these weaknesses can be mitigated somewhat by changing your password frequently, not using the same passwords for different accounts, never using common dictionary words as passwords, and making passwords long — passphrases are better than passwords.


This general challenge-response pattern is quite common in authentication systems that rely upon shared secrets, becauseat no point is the original secret password ever actually stored or transmitted! With such a system, a machine that does not know your secret can verify that you do know it — almost seems magical, doesn’t it? Of course, now that you know how it works, it’s not quite so magical — the salted hash is essentially the shared secret, and it is stored and transmitted.

Clearly we could go on, adding more and more layers of encryption to this system to further mitigate these vulnerabilities, but I’m going to stop here. Readers interested in ways to solve problems such as mutual authentication (where the server authenticates the client AND the client verifies that it is talking to the right server) or other heavy-duty authentication tasks should read this charming dialogue on the design of the Kerberos authentication system: https://web.mit.edu/kerberos/www/dialogue.html

The foregoing is of course just a sketch with lots of details glossed over and greatly simplified for didactic purposes. Real professional-grade authentication systems do not work exactly like any of my sketches, though several are quite similar.

Unix boxes used to typically use a 12 bit salt and hash it with a 56 bit password using a DES-based hash, for instance. That’s pretty weak! A 12 bit salt only makes construction of a dictionary take 4096 times longer — one dictionary for each possible salt. In modern systems the more secure MD5 hash is often used now, which supports arbitrarily large salts. Unix boxes also used to keep the user list in the clear, so that anyone could read the salted hashes of the passwords; nowadays the encrypted passwords are kept in a file that can only be read by the operating system itself — defense in depth!

NT by contrast uses an unsalted 128 bit MD5 hash to store the password equivalent, so it is susceptible to dictionary attacks should the password file be stolen. NTLMv2 is an authentication system based on some of the challenge-response ideas from System #5. It hashes the password, user name, domain, and current time in a fairly complex way; I’ll spare you the details. And of course, many NT and unix boxes use Kerberos-based authentication these days.