Bad comparisons, part four

Posted on January 31, 2011 by ericlippert

One more easy one. I want to “sort” a list into a random, shuffled order. I can do that by simply randomizing whether any two elements are greater than, less than, or equal to each other:

myList.Sort((x, y) => (new Random()).Next(-1, 2));

That generates a random -1, 0 or 1 for every comparison, right? So it will sort the list into random order, right?

There are multiple defects here. First off, clearly this violates all our rules for comparison functions. It does not produce a total ordering, and in fact it can tell you that two particular elements are equal and then later tell you that they have become unequal. The sort algorithm is allowed to go into infinite loops or crash horribly when given such an ill-behaved comparison function. And in fact, some implementations of sorting algorithms attempt to detect this error and will throw an exception if they determine that the comparison is inconsistent over time.

Second, every time this thing is called it creates a new Random instance seeded to the current time, and therefore it produces the same result over and over again if called multiple times in the same timeslice; hardly random.

Shuffling is not sorting; it is the opposite of sorting, so don’t use a sort algorithm to shuffle. There are lots of efficient shuffle algorithms that are easy to implement. (That said, it is legal to shuffle by sorting a list ordered by a randomly chosen key. But the key must be chosen exactly once for each item in the list and there must be a correct comparison function on that key.)

Bad comparisons, part three

Posted on January 31, 2011 by ericlippert

Did you notice how last time my length comparison on strings was unnecessarily verbose? I could have written it like this:

static int ByLength(string x, string y)
{ 
  if (x == null && y == null) return 0;
  if (x == null) return -1; 
  if (y == null) return 1; 
  return CompareInts(x.Length, y.Length); 
}

static int CompareInts(int x, int y) { 
  // positive if x is larger
  // negative if y is larger
  //  zero if equal   
  return x - y; 
}

static Comparison<T> ThenBy<T>(this Comparison<T> firstBy, Comparison<T> thenBy) 
{ 
  return (x,y)=> 
  { 
    int result = firstBy(x, y); 
    return result != 0 ? result : thenBy(x, y); 
  }
}

Much nicer! My string length comparison method is greatly simplified, I can compose comparisons easily with the ThenBy extension method, and I can reuse CompareInts as a helper function in other comparisons I might want to write.

What’s the defect now?

This one should be easy after last time. The lengths of strings are always positive integers and in practice they never go above a few hundred million; the CLR does not allow you to allocate truly enormous multi-gigabyte strings. But though CompareInts is safe for inputs which are string lengths, it is not safe in general. In particular, for the inputs Int32.MinValue and Int32.MaxValue, the difference is 1. Clearly the smallest possible integer is smaller than the largest possible integer, but this method gives the opposite result! CompareInts should read:

static int CompareInts(int x, int y) 
{ 
  if (x > y) return 1; 
  if (x < y) return -1;
  return 0; 
}

The moral of the story here is that a comparison function that doesn’t compare something is probably wrong. Subtraction is not comparison.

Next time on FAIC: One more bad comparison.

Bad comparisons, part two

Posted on January 24, 2011 by ericlippert

Suppose I want to sort a bunch of strings into order first by length, and then, once they are sorted by length, sort each group that is the same length by some other comparison. We can easily build such a device with higher-order programming:

static Comparison<string> FirstByLength(Comparison<string> thenBy) 
{ 
  return (string x, string y) => 
  { 
    // Null strings are sorted before zero-length strings; remember, we need to provide a total ordering. 
    if (x == null && y == null) 
      return 0; 
    if (x == null) 
      return -1; 
    if (y == null) 
      return 1; 
    if (x.Length > y.Length) 
      return 1; 
    if (x.Length < y.Length) 
      return -1; 
    // They are the same length; sort on some other criterion.
     return thenBy(x, y); 
  }; 
}

Super. This idea of composing new comparison functions out of old ones is pretty neat. We can even built a reversal device:

static Comparison<string> Reverse(Comparison<string> comparison) 
{ 
  return (string x, string y) => -comparison(x, y); 
}

Something is subtly wrong in at least one of these comparison functions. Where’s the defect?

Let’s restate that contract again. The comparison function returns a negative integer if the first argument is smaller than the second, a positive integer if the first is greater than the second, and zero if they are equal. Any negative integer will do, and in particular, Int32.MinValue is a negative integer. Suppose we have a bizarre comparison function that returns Int32.MinValue instead of -1:

Comparison<string> bizarre = whatever;

and we compose it:

Comparison<string> reverseFirstByLength = Reverse(FirstByLength(bizarre));

Suppose two strings are equal in length and bizarre returns Int32.MinValue for those strings. Reverse should return a positive number, but -Int32.MinValue either throws an exception (in a checked context) or returns Int32.MinValue right back (in an unchecked context). Remember, there are more negative numbers that fit into an integer than positive numbers, by one.

The right implementation of Reverse is either to spell it out:

static Comparison<string> Reverse(Comparison<string> comparison)
{ 
  return (string x, string y) => 
  { 
    int result = comparison(x, y); 
    if (result > 0) return -1; 
    if (result < 0) return 1; 
    return 0; 
  }; 
}

Or to simply swap left and right:

static Comparison<string> Reverse(Comparison<string> comparison) 
{   
  return (string x, string y) => comparison(y, x); 
}

Next time on FAIC: Another way that comparisons go wrong.

Bad comparisons, part one

Posted on January 20, 2011 by ericlippert

The mutable List<T> class provides an in-place sort method which can take a comparison delegate. It’s quite handy to be able to sort a list into order by being able to compare any two elements, but you have to make sure you get it right.

First off, what are the requirements of the comparison delegate? They are clearly documented: the comparison takes two elements and returns a 32 bit signed integer. If the first element is greater than the second then the integer is greater than zero. If the first element is less than the second then the integer is less than zero. If the first element is equal to the second then the integer is zero.

See if you can figure out why each of these comparisons can give bad results.

Comparison #1: Putting on my top hat:

enum Clothes 
{ 
  Hat, 
  Tie, 
  Socks, 
  Pocketwatch, 
  Vest, 
  Shirt, 
  Shoes, 
  Cufflinks, 
  Gloves, 
  Tailcoat, 
  Underpants, 
  Trousers 
}
static int Compare(Clothes x, Clothes y) 
{ 
  const int xGreater = 1;   
  const int yGreater = -1;
  // If x has to go on after y then x is the greater   
  // If y has to go on after x then y is the greater   
  // Otherwise, they are equal
  switch (x)   
  { 
  case Clothes.Tie: 
    if (y == Clothes.Shirt) return xGreater; 
    break; 
  case Clothes.Socks: 
    if (y == Clothes.Shoes) return yGreater; 
    break; 
  case Clothes.Pocketwatch: 
    if (y == Clothes.Shirt || y == Clothes.Vest) return xGreater; 
    break; 
  case Clothes.Vest: 
    if (y == Clothes.Shirt) return xGreater; 
    if (y == Clothes.Tailcoat || y == Clothes.Pocketwatch) return yGreater; 
    break; 
  case Clothes.Shirt: 
    if (y == Clothes.Tie || y == Clothes.Pocketwatch || 
      y == Clothes.Vest || y == Clothes.Cufflinks || y == Clothes.Tailcoat) 
      return yGreater; 
    break; 
  case Clothes.Shoes: 
    if (y == Clothes.Trousers || y == Clothes.Socks || y == Clothes.Underpants) 
      return xGreater; 
  break; 
  case Clothes.Cufflinks: 
    if (y == Clothes.Shirt) return xGreater; 
    break; 
  case Clothes.Tailcoat: 
    if (y == Clothes.Vest || y == Clothes.Shirt) return xGreater; 
    break; 
  case Clothes.Underpants: 
    if (y == Clothes.Trousers || y == Clothes.Shoes) return yGreater; 
    break; 
  case Clothes.Trousers: 
    if (y == Clothes.Underpants) return xGreater; 
    if (y == Clothes.Shoes) return yGreater; 
    break;   
  } 
  return 0; 
}

OK, before you read on, can you figure out what the defect here is? It seems perfectly straightforward: if two things have to be ordered then they are ordered, and if not, then we say we don’t care by setting them equal.

Does that work?

If you actually try it out on a real list of randomly ordered clothes, you’ll find that much of the time this sorts the list into a senseless order, not into an order that preserves nice properties like shoes going on after trousers. Why?

An undocumented but extremely important assumption of the sorting algorithm is that the comparison function provides a consistent, total order. Part of providing a total order means that the comparison must preserve the invariant that things equal to the same are equal to each other. This idea that if you don’t care about the order of two things then you can call them equal is simply false. In our example Tie is equal to Hat and Hat is equal to Shirt, and therefore the sort algorithm is justified in believing that Tie should be equal to Shirt, but it isn’t.

Suppose, for example, the first element is Hat. A sort algorithm is perfectly justified in scanning the entire list, determining that everything is equal to the first element, and concluding that therefore it must already be sorted. Clearly a list where every element is equal to the first element is sorted! The actual implementation of QuickSort in the BCL is quite complex and has several clever tricks in it that help optimize the algorithm for common cases, such as subsets of the list already being in sorted order. If the comparison is not consistent then it is easy to fool those heuristics into doing the wrong thing. And in fact, some sort algorithms will go into infinite loops or crash horribly if you give them an incomplete comparison function.

The right algorithm for sorting a set with a partial order is topological sort; use the right tool for the job.

Next time on FAIC: we do the same thing, backwards

Fabulous adventures in coding

Eric Lippert's blog

Monthly Archives: January 2011

Bad comparisons, part four

Bad comparisons, part three

Bad comparisons, part two

Bad comparisons, part one