Bad comparisons, part two

Suppose I want to sort a bunch of strings into order first by length, and then, once they are sorted by length, sort each group that is the same length by some other comparison. We can easily build such a device with higher-order programming:

static Comparison<string> FirstByLength(Comparison<string> thenBy) 
{ 
  return (string x, string y) => 
  { 
    // Null strings are sorted before zero-length strings; remember, we need to provide a total ordering. 
    if (x == null && y == null) 
      return 0; 
    if (x == null) 
      return -1; 
    if (y == null) 
      return 1; 
    if (x.Length > y.Length) 
      return 1; 
    if (x.Length < y.Length) 
      return -1; 
    // They are the same length; sort on some other criterion.
     return thenBy(x, y); 
  }; 
}

Super. This idea of composing new comparison functions out of old ones is pretty neat. We can even built a reversal device:

static Comparison<string> Reverse(Comparison<string> comparison) 
{ 
  return (string x, string y) => -comparison(x, y); 
}

Something is subtly wrong in at least one of these comparison functions. Where’s the defect?

.

.

.

.

.

.

.

.

Let’s restate that contract again. The comparison function returns a negative integer if the first argument is smaller than the second, a positive integer if the first is greater than the second, and zero if they are equal. Any negative integer will do, and in particular, Int32.MinValue is a negative integer. Suppose we have a bizarre comparison function that returns Int32.MinValue instead of -1:

Comparison<string> bizarre = whatever;

and we compose it:

Comparison<string> reverseFirstByLength = Reverse(FirstByLength(bizarre));

Suppose two strings are equal in length and bizarre returns Int32.MinValue for those strings. Reverse should return a positive number, but -Int32.MinValue either throws an exception (in a checked context) or returns Int32.MinValue right back (in an unchecked context). Remember, there are more negative numbers that fit into an integer than positive numbers, by one.

The right implementation of Reverse is either to spell it out:

static Comparison<string> Reverse(Comparison<string> comparison)
{ 
  return (string x, string y) => 
  { 
    int result = comparison(x, y); 
    if (result > 0) return -1; 
    if (result < 0) return 1; 
    return 0; 
  }; 
}

Or to simply swap left and right:

static Comparison<string> Reverse(Comparison<string> comparison) 
{   
  return (string x, string y) => comparison(y, x); 
}

Next time on FAIC: Another way that comparisons go wrong.

Bad comparisons, part one

The mutable List<T> class provides an in-place sort method which can take a comparison delegate. It’s quite handy to be able to sort a list into order by being able to compare any two elements, but you have to make sure you get it right.

First off, what are the requirements of the comparison delegate? They are clearly documented: the comparison takes two elements and returns a 32 bit signed integer. If the first element is greater than the second then the integer is greater than zero. If the first element is less than the second then the integer is less than zero. If the first element is equal to the second then the integer is zero.

See if you can figure out why each of these comparisons can give bad results.

Comparison #1: Putting on my top hat:

enum Clothes 
{ 
  Hat, 
  Tie, 
  Socks, 
  Pocketwatch, 
  Vest, 
  Shirt, 
  Shoes, 
  Cufflinks, 
  Gloves, 
  Tailcoat, 
  Underpants, 
  Trousers 
}
static int Compare(Clothes x, Clothes y) 
{ 
  const int xGreater = 1;   
  const int yGreater = -1;
  // If x has to go on after y then x is the greater   
  // If y has to go on after x then y is the greater   
  // Otherwise, they are equal
  switch (x)   
  { 
  case Clothes.Tie: 
    if (y == Clothes.Shirt) return xGreater; 
    break; 
  case Clothes.Socks: 
    if (y == Clothes.Shoes) return yGreater; 
    break; 
  case Clothes.Pocketwatch: 
    if (y == Clothes.Shirt || y == Clothes.Vest) return xGreater; 
    break; 
  case Clothes.Vest: 
    if (y == Clothes.Shirt) return xGreater; 
    if (y == Clothes.Tailcoat || y == Clothes.Pocketwatch) return yGreater; 
    break; 
  case Clothes.Shirt: 
    if (y == Clothes.Tie || y == Clothes.Pocketwatch || 
      y == Clothes.Vest || y == Clothes.Cufflinks || y == Clothes.Tailcoat) 
      return yGreater; 
    break; 
  case Clothes.Shoes: 
    if (y == Clothes.Trousers || y == Clothes.Socks || y == Clothes.Underpants) 
      return xGreater; 
  break; 
  case Clothes.Cufflinks: 
    if (y == Clothes.Shirt) return xGreater; 
    break; 
  case Clothes.Tailcoat: 
    if (y == Clothes.Vest || y == Clothes.Shirt) return xGreater; 
    break; 
  case Clothes.Underpants: 
    if (y == Clothes.Trousers || y == Clothes.Shoes) return yGreater; 
    break; 
  case Clothes.Trousers: 
    if (y == Clothes.Underpants) return xGreater; 
    if (y == Clothes.Shoes) return yGreater; 
    break;   
  } 
  return 0; 
}

OK, before you read on, can you figure out what the defect here is? It seems perfectly straightforward: if two things have to be ordered then they are ordered, and if not, then we say we don’t care by setting them equal.

Does that work?

.

.

.

.

.

.

.

If you actually try it out on a real list of randomly ordered clothes, you’ll find that much of the time this sorts the list into a senseless order, not into an order that preserves nice properties like shoes going on after trousers. Why?

An undocumented but extremely important assumption of the sorting algorithm is that the comparison function provides a consistent, total order. Part of providing a total order means that the comparison must preserve the invariant that things equal to the same are equal to each other. This idea that if you don’t care about the order of two things then you can call them equal is simply false. In our example Tie is equal to Hat and Hat is equal to Shirt, and therefore the sort algorithm is justified in believing that Tie should be equal to Shirt, but it isn’t.

Suppose, for example, the first element is Hat. A sort algorithm is perfectly justified in scanning the entire list, determining that everything is equal to the first element, and concluding that therefore it must already be sorted. Clearly a list where every element is equal to the first element is sorted! The actual implementation of QuickSort in the BCL is quite complex and has several clever tricks in it that help optimize the algorithm for common cases, such as subsets of the list already being in sorted order. If the comparison is not consistent then it is easy to fool those heuristics into doing the wrong thing. And in fact, some sort algorithms will go into infinite loops or crash horribly if you give them an incomplete comparison function.

The right algorithm for sorting a set with a partial order is topological sort; use the right tool for the job.

Next time on FAIC: we do the same thing, backwards

Why are anonymous types generic?

Suppose you use an anonymous type in C#:

var x = new { A = "hello", B = 123.456 };

Ever taken a look at what code is generated for that thing? If you crack open the assembly with ILDASM or some other tool, you’ll see this mess in the top-level type definitions

.class '<>f__AnonymousType0`2'<'<A>j__TPar','<B>j__TPar'>

What the heck? Let’s clean that up a bit. We’ve mangled the names so that you are guaranteed that you cannot possibly accidentally use this thing “as is” from C#. Turning the mangled names back into regular names, and giving you the declaration and some of the body of the class in C#, that would look like:

[CompilerGenerated]
internal sealed class Anon0<TA, TB>
{
  private readonly TA a;
  private readonly TB b;
  public TA A { get { return this.a; } }
  public TB B { get { return this.b; } }
  public Anon0(TA a, TB b)
  { this.a = a; this.b = b; }
  // plus implementations of Equals, GetHashCode and ToString
}

And then at the usage site, that is compiled as:

var x = new Anon0<string, double>("hello", 123.456);

Again, what the heck? Why isn’t this generated as something perfectly straightforward, like:

[CompilerGenerated]
internal sealed class Anon0
{
  private readonly string a;
  private readonly double b;
  public string A { get { return this.a; } }
  public double B { get { return this.b; } }
  public Anon0(string a, double b)
  { this.a = a; this.b = b; }
  // plus implementations of Equals, GetHashCode and ToString
}

Good question. Consider the following. Suppose you have a library assembly, not written by you, that contains the following types:

public class B
{
  protected class P {}
}

Now, in your source code you have:

class D1 : B
{
  void M() 
  { 
    var x = new { P = new B.P() }; 
  }
}

class D2 : B
{
  void M() 
  { 
    var x = new { P = new B.P() }; 
  }
}

We need to generate an anonymous type, or types, somewhere. Suppose we decide that we want the two anonymous types – which have the same types and the same property names – to unify into one type. (We desire anonymous types that are structurally identical to unify within an assembly because that enables scenarios where multiple methods use generic type inference to infer the same anonymous type; you want to be able to pass instances of that anonymous type around between such methods. Perhaps I’ll do an example of that in the new year.)

Where do we generate that type? How about inside D1:

class D1 : B
{
  [CompilerGenerated]
  ??? sealed class Anon0 { public P P { get { ... } } ... }
  void M() 
  { 
    var x = new { P = new B.P() }; 
  }
}

What is the desired accessibilty of Anon0 in the location marked with question marks? It cannot be private or protected, because then D2 cannot see it. It cannot be either public or internal, because then you’d have a public/internal type with a public property that exposes a protected type, which is illegal. Nor can it be either “protected and internal” or “protected or internal” by similar logic. It cannot have any accessibility! Therefore the anonymous type cannot go in D1.  Obviously by identical logic it cannot go in D2. It cannot go in B because you don’t have the ability to modify the assembly containing class B. The only remaining place it can go is in the global namespace. But at the top level an internal type cannot refer to P, a protected nested type. P is only accessible inside a derived class of B but we’ve already ruled all those out.

But we can put the anonymous type at the top level if it never actually refers to P. If we make generic class Anon0<TP> and construct it with P for TP, then P only ever appears inside D1 and D2, and yet the types unify as desired.

Rather than coming up with some weird heuristic that determined when anonymous types needed to be generic, and making them normally typed otherwise, we simply decided to embrace the general solution. Anonymous types are always generated as generic types even when doing so is not strictly necessary. We did extensive performance testing to ensure that this choice did not adversely impact realistic scenarios, and as it turned out, the CLR is really quite buff when it comes to construction of generic types with lots of type parameters.

And with that, I’m off for the rest of the year. Air travel is too expensive this year, so I’m going to miss my traditional family Boxing Day celebration, but I’m sure it’ll be delightful to spend some time in Seattle for the holidays. I hope you all have a safe and festive holiday season, and we’ll see you for more fabulous adventures in 2011.

Debunking another myth about value types

Here’s another myth about value types that I sometimes hear:

“Obviously, using the new operator on a reference type allocates memory on the heap. But a value type is called a value type because it stores its own value, not a reference to its value. Therefore, using the new operator on a value type allocates no additional memory. Rather, the memory already allocated for the value is used.”

That seems plausible, right? Suppose you have an assignment to, say, a field s of type S:
Continue reading

Continuing to an outer loop

When you have a nested loop, sometimes you want to “continue” the outer loop, not the inner loop. For example, here we have a sequence of criteria and a sequence of items, and we wish to determine if there is any item which matches every criterion:

match = null;
foreach(var item in items)
{
  foreach(var criterion in criteria)
  {
    if (!criterion.IsMetBy(item))
    {
      // No point in checking anything further; this is not
      // a match. We want to “continue” the outer loop. How?
    }
  }
  match = item;
  break;
}

There are many ways to achieve this. Here are a few, in order of increasing awesomeness:

Technique #1 (ugly): When you have a loop, the continue statement is essentially a goto which branches to the bottom of the loop, just before it does the loop condition test and pops back up to the top of the loop. (Apparently when you spell goto as continue, the “all gotos are all evil all the time” crowd doesn’t bother you as much.) You can of course simply make this explicit:

match = null;
foreach(var item in items)
  {
  foreach(var criterion in criteria)
  {
    if (!criterion.IsMetBy(item))
    {
      goto OUTERCONTINUE;
    }
  }
  match = item;
  break;
  OUTERCONTINUE:
  ;
}

Technique #2 (better): When I see a nested loop, I almost always consider refactoring the interior loop into a method of its own.

match = null;
foreach(var item in items)
{
  if (MeetsAllCriteria(item, critiera))
  {
    match = item;
    break;
  }
}

where the body of MeetsAllCriteria is obviously

foreach(var criterion in criteria)
{
  if (!criterion.IsMetBy(item))
  {
    return false;
  }
}
return true;

Technique #3 (awesome): The “mechanism” code here is emphasizing how the code works and hiding the meaning of the code. The purpose of the code is to answer the question “what is the first item, if any, that meets all the criteria?” If that’s the meaning of the code, then write code that reads more like that meaning:

var matches = from item in items
              where criteria.All(criterion=>criterion.IsMetBy(item))
              select item;
match = matches.FirstOrDefault();

That is, search the items for items that meet every criterion, and take the first of them, if there is one. The code now reads like what it is supposed to be implementing, and of course, if there are no loops in the first place then there’s no need to “continue” at all!

The thing I love most about LINQ is that it enables a whole new way to think about problem solving in C#. It’s gotten to the point now that whenever I write a loop, I stop and think about two things before I actually type “for”:

  • Have I described anywhere what this code is doing semantically, or does the code emphasize more what mechanisms I am using at the expense of obscuring the semantics?
  • Is there any way to characterize the solution to my problem as the result of a query against a data source, rather than as the result of a set of steps to be followed procedurally?

Closing over the loop variable considered harmful, part two

(This is part two of a two-part series on the loop-variable-closure problem. Part one is here.)


UPDATE: We are taking the breaking change. In C# 5, the loop variable of a foreach will be logically inside the loop, and therefore closures will close over a fresh copy of the variable each time. The for loop will not be changed. We return you now to our original article.


Thanks to everyone who left thoughtful and insightful comments on last week’s post.

More countries really ought to implement Instant-Runoff Voting; it would certainly appeal to the geek crowd. Many people left complex opinions of the form “I’d prefer to make the change, but if you can’t do that then make it a warning”. Or “don’t make the change, do make it a warning”, and so on. But what I can deduce from reading the comments is that there is a general lack of consensus on what the right thing to do here is. In fact, I just did a quick tally:

  • Commenters who expressed support for a warning: 26
  • Commenters who expressed the sentiment “it’s better to not make the change”: 24
  • Commenters who expressed the sentiment “it’s better to make the change”: 25

Wow. I guess we’ll flip a coin. :-)[1. Mr. Smiley Face indicates that Eric is indulging in humourous japery.]

Continue reading

Closing over the loop variable considered harmful, part one

UPDATE: We are taking the breaking change. In C# 5, the loop variable of a foreach will be logically inside the loop, and therefore closures will close over a fresh copy of the variable each time. The for loop will not be changed. We return you now to our original article.


I don’t know why I haven’t blogged about this one before; this is the single most common incorrect bug report we get. That is, someone thinks they have found a bug in the compiler, but in fact the compiler is correct and their code is wrong. That’s a terrible situation for everyone; we very much wish to design a language which does not have “gotcha” features like this.

But I’m getting ahead of myself. Continue reading

Simple names are not so simple, part two

Today, the solution to the puzzle from last time. The code is correct, and compiles without issue. I was quite surprised when I first learned that; it certainly looks like it violates our rule about not using the same simple name to mean two different things in one block.

The key is to understanding why this is legal is that the query comprehensions and foreach loops are specified as syntactic sugars for another program, and it is that program which is actually analyzed for correctness. Our original program:

static void Main()
{
  int[] data = { 1, 2, 3, 1, 2, 1 };
    foreach (var m in from m in data orderby m select m)
      System.Console.Write(m);
}

is transformed into

static void Main()
{
  int[] data = { 1, 2, 3, 1, 2, 1 };
  {
    IEnumerator<int> e = 
      ((IEnumerable<int>)(data.OrderBy(m=>m)).GetEnumerator();
    try
    {
      int m; // Inside the "while" in C# 5 and above, outside in C# 1 through 4.
      while(e.MoveNext())
      {
        m = (int)(int)e.Current;
        Console.Write(m);
      }
    }
    finally
    {
      if (e != null) ((IDisposable)e).Dispose();
    }
  }
}

There are five usages of m in this transformed program; m is:

  1. declared as the formal parameter of a lambda.
  2. used in the body of the lambda; here it refers to the formal parameter.
  3. declared as a local variable
  4. written to in the loop; here it refers to the local variable
  5. read from in the loop; here it refers to the local variable

Is there any usage of a local variable before its declaration? No.

Are there any two declarations that have the same name in the same declaration space? It would appear so. The body of Main defines a local variable declaration space, and clearly the body of Main contains, indirectly, two declarations for m, one as a formal parameter and one as a local. But I said last time: local variable declaration spaces have special rules for determining overlaps. It is illegal for a local variable declaration space to directly contain a declaration such that another nested local variable declaration space contains a declaration of the same name. But an outer declaration space which indirectly contains two such declarations is not  an error. So in this case, no, there are no local variable declarations spaces which directly contain a declaration for m, such that a nested local variable declaration space also directly contains a declaration for m. Our two local variable declarations spaces which directly contain a declaration for m do not overlap anywhere.

Is there any declaration space which contains two inconsistent usages of the simple name m? Yes, again, the outer block of Main contains two inconsistent usages of m. But again, this is not relevant. The question is whether any declaration spaces directly containing a usage of m have an inconsistent usage. Again, we have two declaration spaces but they do not overlap each other, so there’s no problem here either.

The thing which makes this legal, interestingly enough, is the generation of the loop variable declaration logically within the try block. Were it to be generated outside the try block then this would be a violation of the rule about inconsistent usage of a simple name throughout a declaration space.

Simple names are not so simple, part one

C# has many rules that are designed to prevent some common sources of bugs and encourage good programming practices. So many, in fact, that it is often quite confusing to sort out exactly which rule has been violated. I thought I might spend some time talking about what the different rules are. We’ll finish up with a puzzle.

To begin with, it will be vital to understand the difference between scope and declaration space. To refresh your memory of my earlier article: the scope of an entity is the region of text in which that entity may be referred to by its unqualified name. A declaration space is a region of text in which no two things may have the same name (with an exception for methods which differ by signature.) A “local variable declaration space” is a particular kind of declaration space used for declaring local variables; local variable declaration spaces have special rules for determining when they overlap.

The next thing that you have to understand to make any sense o this is what a “simple name” is. A simple name is always either just a plain identifier, like x, or, in some cases, a plain identifier followed by a type argument list, like Frob<int, string>.

Lots of things are treated as “simple names” by the compiler: local variable declarations, lambda parameters, and so on, always have the first form of simple name in their declarations. When you say Console.WriteLine(x); the Console and x are simple names but the WriteLine is not. Confusingly, there are some textual entities which have the form of simple names, but are not treated as simple names by the compiler. We might talk about some of those situations in later fabulous adventures.

So, without further ado, here are some relevant rules which are frequently confused. It’s rules 3 and 4 that people find particularly confusing.

  1. It is illegal to refer to a local variable before its declaration. (This seems reasonable I hope.)
  2. It is illegal to have two local variables of the same name in the same local variable declaration space or nested local variable declaration spaces.
  3. Local variables are in scope throughout the entire block in which the declaration occurs. This is in contrast with C++, in which local variables are in scope in their block only at points after the declaration.
  4. For every occurrence of a simple name, whether in a declaration or as part of an expression, all uses of that simple name within the immediately enclosing local variable declaration space must refer to the same entity.

The purpose of all of these rules is to prevent the class of bugs in which the reader/maintainer of the code is tricked into believing they are referring to one entity with a simple name, but are in fact accidentally referring to another entity entirely. These rules are in particular designed to prevent nasty surprises when performing what ought to be safe refactorings.

Consider a world in which we did not have rules 3 and 4. In that world, this code would be legal:

class C
{
  int x;
  void M()
  {
    // 100 lines of code
    x = 20; // means "this.x";
    Console.WriteLine(x); // means "this.x"
    // 100 lines of code
    int x = 10;
    Console.WriteLine(x); // means "local x"
  }
}

This is hard on the person reading the code, who has a reasonable expectation that the two Console.WriteLine(x) lines do in fact both print out the contents of the same variable. But it is particularly nasty for the maintenance programmer who wishes to impose a reasonable coding standard upon this body of code. “Local variables are declared at the top of the block where they’re used” is a reasonable coding standard in a lot of shops. But changing the code to:

class C
{
  int x;
  void M()
  {
    int x;
    // 100 lines of code
    x = 20; // no longer means "this.x";
    Console.WriteLine(x); // no longer means "this.x"
    // 100 lines of code
    x = 10;
    Console.WriteLine(x); // means "local x"
  }
}

changes the meaning of the code! We wish to discourage authoring of multi-hundred-line methods, but making it harder and more error-prone to refactor them into something cleaner is not a good way to achieve that goal.

Notice that the original version of this program, rule 3 means that the program violates rule 1 — the first usage of x is treated as a reference to the local before it is declared. The fact that it violates rule 1 because of rule 3 is precisely what prevents it from being a violation of rule 4! The meaning of x is consistent throughout the block; it always means the local, and therefore is sometimes used before it is declared. If we scrapped rule 3 then this would be a violation of rule 4, because we would then have two inconsistent meanings for the simple name x within one block.

Now, these rules do not mean that you can refactor willy-nilly. We can still construct situations in which similar refactorings fail. For example:

class C
{
  int x;
  void M()
  {
    {
      // 100 lines of code
      x = 20; // means "this.x";
      Console.WriteLine(x); // means "this.x"
    }
    {
      // 100 lines of code
      int x = 10;
      Console.WriteLine(x); // means "local x"
    }
  }
}

This is perfectly legal. We have the same simple name being used two different ways in two different blocks, but the immediately enclosing block of each usage does not overlap that of any other usage. The local variable is in scope throughout its immediately enclosing block, but that block does not overlap the block above. In this case, it is safe to refactor the declaration of the local to the top of its block, but not safe to refactor the declaration to the top of the outermost block; that would change the meaning of x in the first block. Moving a declaration up is almost always a safe thing to do; moving it out is not necessarily safe.

Now that you know all that, here’s a puzzle for you, a puzzle that I got completely wrong the first time I saw it:

using System.Linq;
class Program
{
  static void Main()
  {
    int[] data = { 1, 2, 3, 1, 2, 1 };
    foreach (var m in from m in data orderby m select m)
      System.Console.Write(m);
  }
}

It certainly looks like simple name m is being used multiple times to mean different things. Is this program legal? If yes, why do the rules for not re-using simple names not apply? If no, precisely what rule has been violated?

Next time on FAIC: The solution to this puzzle.