Unknown's avatar

About ericlippert

http://ericlippert.com

Curiouser and curiouser

Here’s a pattern you see all the time in C#:

class Frob : IComparable<Frob>

At first glance you might ask yourself why this is not a “circular” definition; after all, you’re not allowed to say class Frob : Frob(*). However, upon deeper reflection that makes perfect sense; a Frob is something that can be compared to another Frob. There’s not actually a real circularity there.

This pattern can be genericized further:

class SortedList<T> where T : IComparable<T>

Again, it might seem a bit circular to say that T is constrained to something that is in terms of T, but actually this is just the same as before. T is constrained to be something that can be compared to T. Frob is a legal type argument for a SortedList because one Frob can be compared to another Frob.

But this really hurts my brain:

class Blah<T> where T : Blah<T>

That appears to be circular in (at least) two ways. Is this really legal?

Yes it is legal, and it does have some legitimate uses. I see this pattern rather a lot(**). However, I personally don’t like it and I discourage its use.

This is a C# variation on what’s called the Curiously Recurring Template Pattern in C++, and I will leave it to my betters to explain its uses in that language. Essentially the pattern in C# is an attempt to enforce the usage of the CRTP.

So, why would you want to do that, and why do I object to it?

One reason why people want to do this is to enforce a particular constraint in a type hierarchy. Suppose we have

abstract class Animal
{
  public virtual void MakeFriends(Animal animal);
}

But that means that a Cat can make friends with a Dog, and that would be a crisis of Biblical proportions! (***) What we want to say is

abstract class Animal
{
  public virtual void MakeFriends(THISTYPE animal);
}

so that when Cat overrides MakeFriends, it can only override it with Cat.

Now, that immediately presents a problem in that we’ve just violated the Liskov Substitution Principle. We can no longer call a method on a variable of the abstract base type and have any confidence that type safety is maintained. Variance on formal parameter types has to be contravariance, not covariance, for it to be typesafe. And moreover, we simply don’t have that feature in the CLR type system.

But you can get close with the curious pattern:

abstract class Animal<T> where T : Animal<T>
{
  public virtual void MakeFriends(T animal);
}
class Cat : Animal<Cat>
{
  public override void MakeFriends(Cat cat) {}
}

and hey, we haven’t violated the LSP and we have guaranteed that a Cat can only make friends with a Cat. Beautiful.

Wait a minute… did we really guarantee that?

class EvilDog : Animal<Cat>
{
  public override void MakeFriends(Cat cat) { }
}

We have not guaranteed that a Cat can only make friends with a Cat; an EvilDog can make friends with a Cat too. The constraint only enforces that the type argument to Animal be good; how you use the resulting valid type is entirely up to you. You can use it for a base type of something else if you wish.

So that’s one good reason to avoid this pattern: because it doesn’t actually enforce the constraint you think it does. Everyone has to play along and agree that they’ll use the curiously recurring pattern the way it was intended to be used, rather than the evil dog way that it can be used.

The second reason to avoid this is simply because it bakes the noodle of anyone who reads the code. When I see List<Giraffe> I have a very clear idea of what the relationship is between the List<> part — it means that there are going to be operations that add and remove things — and the Giraffe part — those operations are going to be on giraffes. When I see FuturesContract<T> where T : LegalPolicy I understand that this type is intended to model a legal contract about a transaction in the future which has some particular controlling legal policy. But when I read Blah<T> where T : Blah I have no intuitive idea of what the intended relationship is between Blah<T> and any particular TIt seems like an abuse of a mechanism rather than the modeling of a concept from the program’s “business domain”.

All that said, in practice there are times when using this pattern really does pragmatically solve problems in ways that are hard to model otherwise in C#; it allows you to do a bit of an end-run around the fact that we don’t have covariant return types on virtual methods, and other shortcomings of the type system. That it does so in a manner that does not, strictly speaking, enforce every constraint you might like is unfortunate, but in realistic code, usually not a problem that prevents shipping the product.

My advice is to think very hard before you implement this sort of curious pattern in C#; do the benefits to the customer really outweigh the costs associated with the mental burden you’re placing on the code maintainers?


(*) Due to an unintentional omission, some past editions of the C# specification actually did not say that this was illegal! However, the compiler has always enforced it. In fact, the compiler has over-enforced it, sometimes accidentally catching non-cycles and marking them as cycles.

(**) Most frequently in emails asking “is this really legal?”

(***) Mass hysteria!

Bad comparisons, part four

One more easy one. I want to “sort” a list into a random, shuffled order. I can do that by simply randomizing whether any two elements are greater than, less than, or equal to each other:

myList.Sort((x, y) => (new Random()).Next(-1, 2));

That generates a random -1, 0 or 1 for every comparison, right? So it will sort the list into random order, right?

.

.

.

.

.

.

.

There are multiple defects here. First off, clearly this violates all our rules for comparison functions. It does not produce a total ordering, and in fact it can tell you that two particular elements are equal and then later tell you that they have become unequal. The sort algorithm is allowed to go into infinite loops or crash horribly when given such an ill-behaved comparison function. And in fact, some implementations of sorting algorithms attempt to detect this error and will throw an exception if they determine that the comparison is inconsistent over time.

Second, every time this thing is called it creates a new Random instance seeded to the current time, and therefore it produces the same result over and over again if called multiple times in the same timeslice; hardly random.

Shuffling is not sorting; it is the opposite of sorting, so don’t use a sort algorithm to shuffle. There are lots of efficient shuffle algorithms that are easy to implement. (That said, it is legal to shuffle by sorting a list ordered by a randomly chosen key. But the key must be chosen exactly once for each item in the list and there must be a correct comparison function on that key.)

Bad comparisons, part three

Did you notice how last time my length comparison on strings was unnecessarily verbose? I could have written it like this:

static int ByLength(string x, string y)
{ 
  if (x == null && y == null) return 0;
  if (x == null) return -1; 
  if (y == null) return 1; 
  return CompareInts(x.Length, y.Length); 
}

static int CompareInts(int x, int y) { 
  // positive if x is larger
  // negative if y is larger
  //  zero if equal   
  return x - y; 
}

static Comparison<T> ThenBy<T>(this Comparison<T> firstBy, Comparison<T> thenBy) 
{ 
  return (x,y)=> 
  { 
    int result = firstBy(x, y); 
    return result != 0 ? result : thenBy(x, y); 
  }
}

Much nicer! My string length comparison method is greatly simplified, I can compose comparisons easily with the ThenBy extension method, and I can reuse CompareInts as a helper function in other comparisons I might want to write.

What’s the defect now?

.

.

.

.

.

.

This one should be easy after last time. The lengths of strings are always positive integers and in practice they never go above a few hundred million; the CLR does not allow you to allocate truly enormous multi-gigabyte strings. But though CompareInts is safe for inputs which are string lengths, it is not safe in general. In particular, for the inputs Int32.MinValue and Int32.MaxValue, the difference is 1. Clearly the smallest possible integer is smaller than the largest possible integer, but this method gives the opposite result! CompareInts should read:

static int CompareInts(int x, int y) 
{ 
  if (x > y) return 1; 
  if (x < y) return -1;
  return 0; 
}

The moral of the story here is that a comparison function that doesn’t compare something is probably wrong. Subtraction is not comparison.

Next time on FAIC: One more bad comparison.

Bad comparisons, part two

Suppose I want to sort a bunch of strings into order first by length, and then, once they are sorted by length, sort each group that is the same length by some other comparison. We can easily build such a device with higher-order programming:

static Comparison<string> FirstByLength(Comparison<string> thenBy) 
{ 
  return (string x, string y) => 
  { 
    // Null strings are sorted before zero-length strings; remember, we need to provide a total ordering. 
    if (x == null && y == null) 
      return 0; 
    if (x == null) 
      return -1; 
    if (y == null) 
      return 1; 
    if (x.Length > y.Length) 
      return 1; 
    if (x.Length < y.Length) 
      return -1; 
    // They are the same length; sort on some other criterion.
     return thenBy(x, y); 
  }; 
}

Super. This idea of composing new comparison functions out of old ones is pretty neat. We can even built a reversal device:

static Comparison<string> Reverse(Comparison<string> comparison) 
{ 
  return (string x, string y) => -comparison(x, y); 
}

Something is subtly wrong in at least one of these comparison functions. Where’s the defect?

.

.

.

.

.

.

.

.

Let’s restate that contract again. The comparison function returns a negative integer if the first argument is smaller than the second, a positive integer if the first is greater than the second, and zero if they are equal. Any negative integer will do, and in particular, Int32.MinValue is a negative integer. Suppose we have a bizarre comparison function that returns Int32.MinValue instead of -1:

Comparison<string> bizarre = whatever;

and we compose it:

Comparison<string> reverseFirstByLength = Reverse(FirstByLength(bizarre));

Suppose two strings are equal in length and bizarre returns Int32.MinValue for those strings. Reverse should return a positive number, but -Int32.MinValue either throws an exception (in a checked context) or returns Int32.MinValue right back (in an unchecked context). Remember, there are more negative numbers that fit into an integer than positive numbers, by one.

The right implementation of Reverse is either to spell it out:

static Comparison<string> Reverse(Comparison<string> comparison)
{ 
  return (string x, string y) => 
  { 
    int result = comparison(x, y); 
    if (result > 0) return -1; 
    if (result < 0) return 1; 
    return 0; 
  }; 
}

Or to simply swap left and right:

static Comparison<string> Reverse(Comparison<string> comparison) 
{   
  return (string x, string y) => comparison(y, x); 
}

Next time on FAIC: Another way that comparisons go wrong.

Bad comparisons, part one

The mutable List<T> class provides an in-place sort method which can take a comparison delegate. It’s quite handy to be able to sort a list into order by being able to compare any two elements, but you have to make sure you get it right.

First off, what are the requirements of the comparison delegate? They are clearly documented: the comparison takes two elements and returns a 32 bit signed integer. If the first element is greater than the second then the integer is greater than zero. If the first element is less than the second then the integer is less than zero. If the first element is equal to the second then the integer is zero.

See if you can figure out why each of these comparisons can give bad results.

Comparison #1: Putting on my top hat:

enum Clothes 
{ 
  Hat, 
  Tie, 
  Socks, 
  Pocketwatch, 
  Vest, 
  Shirt, 
  Shoes, 
  Cufflinks, 
  Gloves, 
  Tailcoat, 
  Underpants, 
  Trousers 
}
static int Compare(Clothes x, Clothes y) 
{ 
  const int xGreater = 1;   
  const int yGreater = -1;
  // If x has to go on after y then x is the greater   
  // If y has to go on after x then y is the greater   
  // Otherwise, they are equal
  switch (x)   
  { 
  case Clothes.Tie: 
    if (y == Clothes.Shirt) return xGreater; 
    break; 
  case Clothes.Socks: 
    if (y == Clothes.Shoes) return yGreater; 
    break; 
  case Clothes.Pocketwatch: 
    if (y == Clothes.Shirt || y == Clothes.Vest) return xGreater; 
    break; 
  case Clothes.Vest: 
    if (y == Clothes.Shirt) return xGreater; 
    if (y == Clothes.Tailcoat || y == Clothes.Pocketwatch) return yGreater; 
    break; 
  case Clothes.Shirt: 
    if (y == Clothes.Tie || y == Clothes.Pocketwatch || 
      y == Clothes.Vest || y == Clothes.Cufflinks || y == Clothes.Tailcoat) 
      return yGreater; 
    break; 
  case Clothes.Shoes: 
    if (y == Clothes.Trousers || y == Clothes.Socks || y == Clothes.Underpants) 
      return xGreater; 
  break; 
  case Clothes.Cufflinks: 
    if (y == Clothes.Shirt) return xGreater; 
    break; 
  case Clothes.Tailcoat: 
    if (y == Clothes.Vest || y == Clothes.Shirt) return xGreater; 
    break; 
  case Clothes.Underpants: 
    if (y == Clothes.Trousers || y == Clothes.Shoes) return yGreater; 
    break; 
  case Clothes.Trousers: 
    if (y == Clothes.Underpants) return xGreater; 
    if (y == Clothes.Shoes) return yGreater; 
    break;   
  } 
  return 0; 
}

OK, before you read on, can you figure out what the defect here is? It seems perfectly straightforward: if two things have to be ordered then they are ordered, and if not, then we say we don’t care by setting them equal.

Does that work?

.

.

.

.

.

.

.

If you actually try it out on a real list of randomly ordered clothes, you’ll find that much of the time this sorts the list into a senseless order, not into an order that preserves nice properties like shoes going on after trousers. Why?

An undocumented but extremely important assumption of the sorting algorithm is that the comparison function provides a consistent, total order. Part of providing a total order means that the comparison must preserve the invariant that things equal to the same are equal to each other. This idea that if you don’t care about the order of two things then you can call them equal is simply false. In our example Tie is equal to Hat and Hat is equal to Shirt, and therefore the sort algorithm is justified in believing that Tie should be equal to Shirt, but it isn’t.

Suppose, for example, the first element is Hat. A sort algorithm is perfectly justified in scanning the entire list, determining that everything is equal to the first element, and concluding that therefore it must already be sorted. Clearly a list where every element is equal to the first element is sorted! The actual implementation of QuickSort in the BCL is quite complex and has several clever tricks in it that help optimize the algorithm for common cases, such as subsets of the list already being in sorted order. If the comparison is not consistent then it is easy to fool those heuristics into doing the wrong thing. And in fact, some sort algorithms will go into infinite loops or crash horribly if you give them an incomplete comparison function.

The right algorithm for sorting a set with a partial order is topological sort; use the right tool for the job.

Next time on FAIC: we do the same thing, backwards

Why are anonymous types generic?

Suppose you use an anonymous type in C#:

var x = new { A = "hello", B = 123.456 };

Ever taken a look at what code is generated for that thing? If you crack open the assembly with ILDASM or some other tool, you’ll see this mess in the top-level type definitions

.class '<>f__AnonymousType0`2'<'<A>j__TPar','<B>j__TPar'>

What the heck? Let’s clean that up a bit. We’ve mangled the names so that you are guaranteed that you cannot possibly accidentally use this thing “as is” from C#. Turning the mangled names back into regular names, and giving you the declaration and some of the body of the class in C#, that would look like:

[CompilerGenerated]
internal sealed class Anon0<TA, TB>
{
  private readonly TA a;
  private readonly TB b;
  public TA A { get { return this.a; } }
  public TB B { get { return this.b; } }
  public Anon0(TA a, TB b)
  { this.a = a; this.b = b; }
  // plus implementations of Equals, GetHashCode and ToString
}

And then at the usage site, that is compiled as:

var x = new Anon0<string, double>("hello", 123.456);

Again, what the heck? Why isn’t this generated as something perfectly straightforward, like:

[CompilerGenerated]
internal sealed class Anon0
{
  private readonly string a;
  private readonly double b;
  public string A { get { return this.a; } }
  public double B { get { return this.b; } }
  public Anon0(string a, double b)
  { this.a = a; this.b = b; }
  // plus implementations of Equals, GetHashCode and ToString
}

Good question. Consider the following. Suppose you have a library assembly, not written by you, that contains the following types:

public class B
{
  protected class P {}
}

Now, in your source code you have:

class D1 : B
{
  void M() 
  { 
    var x = new { P = new B.P() }; 
  }
}

class D2 : B
{
  void M() 
  { 
    var x = new { P = new B.P() }; 
  }
}

We need to generate an anonymous type, or types, somewhere. Suppose we decide that we want the two anonymous types – which have the same types and the same property names – to unify into one type. (We desire anonymous types that are structurally identical to unify within an assembly because that enables scenarios where multiple methods use generic type inference to infer the same anonymous type; you want to be able to pass instances of that anonymous type around between such methods. Perhaps I’ll do an example of that in the new year.)

Where do we generate that type? How about inside D1:

class D1 : B
{
  [CompilerGenerated]
  ??? sealed class Anon0 { public P P { get { ... } } ... }
  void M() 
  { 
    var x = new { P = new B.P() }; 
  }
}

What is the desired accessibilty of Anon0 in the location marked with question marks? It cannot be private or protected, because then D2 cannot see it. It cannot be either public or internal, because then you’d have a public/internal type with a public property that exposes a protected type, which is illegal. Nor can it be either “protected and internal” or “protected or internal” by similar logic. It cannot have any accessibility! Therefore the anonymous type cannot go in D1.  Obviously by identical logic it cannot go in D2. It cannot go in B because you don’t have the ability to modify the assembly containing class B. The only remaining place it can go is in the global namespace. But at the top level an internal type cannot refer to P, a protected nested type. P is only accessible inside a derived class of B but we’ve already ruled all those out.

But we can put the anonymous type at the top level if it never actually refers to P. If we make generic class Anon0<TP> and construct it with P for TP, then P only ever appears inside D1 and D2, and yet the types unify as desired.

Rather than coming up with some weird heuristic that determined when anonymous types needed to be generic, and making them normally typed otherwise, we simply decided to embrace the general solution. Anonymous types are always generated as generic types even when doing so is not strictly necessary. We did extensive performance testing to ensure that this choice did not adversely impact realistic scenarios, and as it turned out, the CLR is really quite buff when it comes to construction of generic types with lots of type parameters.

And with that, I’m off for the rest of the year. Air travel is too expensive this year, so I’m going to miss my traditional family Boxing Day celebration, but I’m sure it’ll be delightful to spend some time in Seattle for the holidays. I hope you all have a safe and festive holiday season, and we’ll see you for more fabulous adventures in 2011.

Debunking another myth about value types

Here’s another myth about value types that I sometimes hear:

“Obviously, using the new operator on a reference type allocates memory on the heap. But a value type is called a value type because it stores its own value, not a reference to its value. Therefore, using the new operator on a value type allocates no additional memory. Rather, the memory already allocated for the value is used.”

That seems plausible, right? Suppose you have an assignment to, say, a field s of type S:
Continue reading

No backtracking, part two

As i was saying last time, the nice thing about “no backtracking” is that it makes the language much easier to understand. Simple rules benefit both the compiler and the code reader; both are attempting to read the code to make sense of it. It is not always a good idea to take advantage of the compiler’s ability to search a large space if that makes it harder for a human to understand the code.

Suppose you have something like this mess: (*)

namespace XYZ.DEF

    public class GHI {} 

namespace QRS.DEF.GHI 

    public class JKL { } 

… in another file …

using QRS; 
namespace TUV  
{
    using XYZ;
    namespace ABC
    {
        namespace DEF
        {
            class GHI { }
            class MNO : DEF.GHI.JKL { }
        }
    }

And now we must work out the base type of MNO.

With no backtracking we say “The nearest container of MNO that has a member DEF is ABC, therefore DEF means ABC.DEF”. Therefore GHI is ABC.DEF.GHI. Therefore JKL is ABC.DEF.GHI.JKL, which does not exist, therefore, give an error. The developer must fix the error by giving a type name that lets the compiler identify which DEF you meant.

If we had backtracking, what would we have to do? We’d get that error, and then we’d backtrack. Does XYZ contain a DEF? Yes. Does it contain a GHI? Yes. Does it contain a JKL? No. Backtrack again. Does QRS contain an DEF.GHI.JKL? Yes.

That works, but can we logically conclude from the fact that it works that it is the one the user meant?

Who the heck knows in this crazy situation? We got all kinds of good bindings in there that then went bad very late in the game. The idea that we stumbled upon the desired answer after going down so many blind alleys seems highly suspect. Maybe there is yet another choice in there that is the one the user meant. We cannot know that unless we try all of them, and again, that could involve a lot of searching.

The correct thing to do here is not to backtrack multiple times and try out all kinds of worse bindings for every stage of the lookup. The correct thing to do is to say “buddy, the best possible match for this lookup gives nonsensical results; give me something less ambiguous to work with here please.”

An unfortunate fact about writing a language where the compiler by design complains loudly if the best match is something that doesn’t work, is that developers frequently say “well, sure, in general I want the compiler to point out all my mistakes — or, rather, all my coworker’s mistakes. But for this specific case, I know what I am doing, so please, compiler, do what I mean, not what I say.”

Trouble is, you can’t have it both ways. You can’t have both a compiler that both enforces rigid rules that make it highly likely that suspicious code will be aggressively identified as erroneous and allow crazy code via compiler heuristics that figure out “what I really meant” when you write something that the compiler quite rightly sees as ambiguous or wrong.

I could discuss many more places where we could do backtracking but do not. Method type inference, for example, always either makes progress or fails; it never backtracks in C# (**). But I think I will leave it at that. Except to say that there is one place where we do use backtracking, and that’s analysis of overload resolution in nested lambdas. I wrote about that here.

(*) When reading this remember that in C#, “namespace XYZ.DEF { }” is a short form for “namespace XYZ{ namespace DEF { } }”

(**) Unlike in, say, F#.


Commentary from 2021:

There were a number of good comments to the original post, which pointed out:

  • Error analysis often has to perform some sort of backtracking to give a good error. In the examples I gave, for instance, we could run a backtracking analysis and then give a “did you mean…?” style error message if backtracking finds a solution, and a different error if no solution can be found. That’s a great idea in principle, but in practice we often end up in situations where backtracking is expensive, so we’d want to put a box around the amount of time and resources the error reporting system uses.
  • This is even more the case when IntelliSense is involved. IntelliSense by definition is typically operating on code that is wrong; if the code was right, why are you editing it? IntelliSense needs to give good options, but it needs to do so in under 30 milliseconds, so it uses lots of tricks to meet its time budget.
  • There was also some good discussion of the deeper language design point here. There are times when we want to use backtracking algorithms, and times we do not; what motivates that choice? There is not just one thing; like all design and implementation problems it comes down to resolving conflicts between incompatible principles. One of the nice things about “no backtracking” algorithms is that they often allow “local” analysis; if we see ABC.DEF.GHI and immediately after, ABC.DEF.XYZ, we know that no matter what, the ABC.DEF means the same thing; the meanings of ABC and ABC.DEF do not change depending on whether GHI and XYZ produce errors or not. That’s a nice property to have.

No backtracking, part one

A number of the articles I’ve published over the years involve “backtracking” algorithms; most recently my series on how to solve Sudoku puzzles (and more generally, all graph colouring problems) by backtracking. The idea of a backtracking algorithm is really simple and powerful: when faced with a choice, try every possibility. If all of them go wrong, backtrack to the previous choice and try a different possibility. Backtracking algorithms are essentially a depth-first search of the space of possible solutions.

But what if the solution space is large? There are more ways to fill in nine digits in 81 places than there are protons in the entire Universe; we can’t check them all. The reason that backtracking works so rapidly for Sudoku puzzles is because typically every guess eliminates many of the branches of the solution tree. If you guess that the first square is 1 then you do not need to explore the possibility that the second square is 1; you can just prune that branch and never explore it. Since that branch itself is vast, that’s a good strategy! (There are pathological cases for this algorithm, but in practice it is fast.)

Backtracking algorithms work well for many situations, but we have consciously eschewed them in the design of C#. (With a few exceptions). For example, there is no backtracking that ever crosses “phases” of compilation in C#.(*) When we are presented with some source code the first thing we do is “lex” it; break it up into “tokens”. This is akin to finding the words in a sentence by looking at the spacing and punctuation. Then we do a grammatical analysis of the token stream, and then we do a semantic analysis of the grammatical analysis. When one of those phases finds an error, we never backtrack to a previous phases and ask for more analysis. Suppose for example you have this silly code:

x = j+++++1;

That is lexically correct C# code. The lexer is a greedy lexer; it tries to make the largest token it can at any given time. So it tokenizes this as:

x = j ++ ++ + 1 ;

Now the grammar analyzer takes that token stream and says “this is an expression involving two identifiers, one constant and four operators; how should I parenthesize that?” The grammar analyzer determines using the rules of precedence and associativity that this means

x=(((j++)++)+1);

So this means “evaluate x, then increment j, then increment whatever the previous result was, then add 1 to the previous result, then assign the previous result to x, done”

Then the semantic analyzer checks that to make sure that all those operations make sense. They do not, because the result of j++ is a value, not a variable, but the second ++ requires a variable. The semantic analyzer gives an error, and the program does not compile.

If we had a backtracking compiler, the semantic analyzer could tell the parser “no, you did something wrong, give me a different parse tree if there is one”.

Suppose the parser does that. It could say well, maybe that last + was not a binary operator but rather the unary operator + applied to 1, oh, but if we do that then we have two expressions next to each other with no intervening operator. Same thing if the second ++ applies to the +1 instead of the j++. So no, there is no other parse of this. So push back on the lexer and ask it to backtrack.

The lexer could then say, oh, sure, if you don’t like that then maybe my guess that the second ++ was ++ was wrong. Try this one:

x = j ++ + ++ 1 ;

and then the parser says OK, if that’s where the token breaks are then the grammatical analysis is:

x=((j++)+(++1));

and the semantic analyzer determines that’s wrong too, this time because you can’t have a ++ on a constant. So the semantic analyzer tells the parser, no, that’s not it either, and the parser tells the lexer, no, that’s not it either. And the lexer says “well, maybe that wasn’t a ++ at all the second time, maybe that should be

x = j ++ + + + 1 ;

And now the parser says “oh, I see what that is, it is two unary plus operators applied to 1 and an addition operator:

x=((j++)+(+(+1)));

and the semantic analyzer says “finally! something I can work with.”

That’s not what we do. This is a silly example, but perhaps you can see that there are two major problems with this strategy.

First, though most of the time lexing is unambiguous, in some pathological cases there are potentially enormously many ways that simple programs could be lexed. Just determining where the breaks could go between five plus signs gives eight possibilities; it grows exponentially from there. Since most of the time programs lex unambiguously, and when they do lex ambiguously, the space to explore could be really big, it’s better to always lex greedily and not backtrack ever. If a program fails grammatical analysis because of an unintended lexical structure then the best thing to do is tell the user and they’ll fix it.

The second major problem is that when backtracking, how do you know which backtracking is better? Of those eight possibilities, two give program fragments that pass semantic analysis: the one we just found, and x=(j+(+(+(+(+1))))); Is it at all clear which of the two possibilities is the one the user meant when they wrote this ridiculous statement? Remember, we’re not trying to solve a Sudoku puzzle here, where any given combination of numbers is as meaningless as the next; we are attempting to output IL that has the same meaning as the meaning intended by the user! One of these possibilities mutates j and the other does not, so it makes a difference. C# is not a “guess what you meant” language, it’s a “do what you said” language(**). If what you said was ambiguous, we should tell you about it.


Next time on FAIC: we’ll return to this point by looking at how the name resolution algorithm also does not backtrack.

(*) Unlike JavaScript, interestingly enough. The ECMAScript specification notes that there are lexical ambiguities involving the / character; it could be introducing a comment as // or as /*, or closing a comment as */, or could be a normal division operator, or a /= operator, or opening or closing a regular expression. Some of those overlap; The spec actually says that the correct lex is the one that produces a grammatically sensible program!

(**) Again, unlike JavaScript.


Commentary from 2021:

There were a lot of good comments (and some needlessly pedantic comments, imagine that) on the original article that went down a few ratholes mentioned above:

  • There were some points made about the difficulty of having lexical ambiguities involving the / symbol in JS that are exacerbated by automatic semicolon insertion
  • Some readers pointed out that sudoku puzzles are required by the rules of the puzzle to have a unique solution; though that is true, I was just using it as an example of a problem amenable to a backtracking algorithm; I was not intending to draw a deeper analogy between solving sudoku puzzles and building a compiler. (Fun fact though: it is possible to encode a sudoku puzzle into a C# program such that the compiler is required to find a solution or produce an error if the puzzle is ambiguous or unsolvable.)
  • Some readers asked about scenarios that involve a kind of backtracking in the parser, but not “across phases”. For example, consider: (T)-1 — does that parse as a subtraction or as a cast of -1? Or my favourite: A(B<C,D>(E)) which parses as two arguments to A in C# 1, and one in C# 2. In both these situations the parser is required to “look ahead” arbitrarily far in the token stream to determine how the parse works; most of the time the parser only has to look ahead one token to determine what the parse should be. A “look ahead” and a “backtrack” are kinda sorta the same thing. Similarly there are scenarios where it is not clear whether >> is a shift operator or the end of a nested generic type.

In 2016 I worked on rearchitecting the Hack parser; Hack is a gradually-typed variant of PHP used inside Facebook. One of the major difficulties I faced in writing a purely functional, full-fidelity Hack parser is that Hack has XML literals in the language, much like VB. The lexical and grammatical rules of the language are completely different inside these XML regions and I did end up having to build some cross-phase backtracking into the parser, which was quite vexing.

The Truth About Value Types

As you know if you’ve read this blog for a while, I’m disturbed by the myth that “value types go on the stack”. Unfortunately, there are plenty of examples in our own documentation and in many books that reinforce this myth, either subtly or overtly. I’m opposed to it because:

  1. It is usually stated incorrectly: the statement should be “value types can be stored on the stack”, instead of the more common “value types are always stored on the stack”.
  2. It is almost always irrelevant. We’ve worked hard to make a managed environment where the distinctions between different kinds of storage are hidden from the user. Unlike some languages, in which you must know whether a particular storage is on the stack or the heap for correctness reasons.
  3. It is incomplete. What about references? References are neither value types nor instances of reference types, but they are values. They’ve got to be stored somewhere. Do they go on the stack or the heap? Why does no one ever talk about them? Just because they don’t have a type in the C# type system is no reason to ignore them.

The way in the past I’ve usually pushed back on this myth is to say that the real statement should be “in the Microsoft implementation of C# on the desktop CLR, value types are stored on the stack when the value is a local variable or temporary that is not a closed-over local variable of a lambda or anonymous method, and the method body is not an iterator block, and the jitter chooses to not enregister the value.”

The sheer number of weasel words in there is astounding, but they’re all necessary:

  • Versions of C# provided by other vendors may choose other allocation strategies for their temporary variables; there is no language requirement that a data structure called “the stack” be used to store locals of value type.
  • We have many versions of the CLI that run on embedded systems, in web browsers, and so on. Some may run on exotic hardware. I have no idea what the memory allocation strategies of those versions of the CLI are. The hardware might not even have the concept of “the stack” for all I know. Or there could be multiple stacks per thread. Or everything could go on the heap.
  • Lambdas and anonymous methods hoist local variables to become heap-allocated fields; those are not on the stack anymore.
  • Iterator blocks in today’s implementation of C# on the desktop CLR also hoist locals to become heap-allocated fields. They do not have to! We could have chosen to implement iterator blocks as coroutines running on a fiber with a dedicated stack. In that case, the locals of value type could go on the stack of the fiber.
  • People always seem to forget that there is more to memory management than “the stack” and “the heap”. Registers are neither on the stack or the heap, and it is perfectly legal for a value type to go in a register if there is one of the right size. If if is important to know when something goes on the stack, then why isn’t it important to know when it goes in a register? Conversely, if the register scheduling algorithm of the jit compiler is unimportant for most users to understand, then why isn’t the stack allocation strategy also unimportant?

Having made these points many times in the last few years, I’ve realized that the fundamental problem is in the mistaken belief that the type system has anything whatsoever to do with the storage allocation strategy. It is simply false that the choice of whether to use the stack or the heap has anything fundamentally to do with the type of the thing being stored. The truth is: the choice of allocation mechanism has to do only with the known required lifetime of the storage.

Once you look at it that way then everything suddenly starts making much more sense. Let’s break it down into some simple declarative sentences.

  • There are three kinds of values: (1) instances of value types, (2) instances of reference types, and (3) references. (Code in C# cannot manipulate instances of reference types directly; it always does so via a reference. In unsafe code, pointer types are treated like value types for the purposes of determining the storage requirements of their values.)
  • There exist “storage locations” which can store values.
  • Every value manipulated by a program is stored in some storage location.
  • Every reference (except the null reference) refers to a storage location.
  • Every storage location has a “lifetime”. That is, a period of time in which the storage location’s contents are valid.
  • The time between a start of execution of a particular method and the method returning normally or throwing an exception is the “activation period” of that method execution.
  • Code in a method can require the use of a storage location. If the required lifetime of the storage location is longer than the activation period of the current method execution then the storage is said to be “long lived”. Otherwise it is “short lived”. (Note that when method M calls method N, the use of the storage locations for the parameters passed to N and the value returned by N is required by M.)

Now we come to implementation details. In the Microsoft implementation of C# on the CLR:

  • There are three kinds of storage locations: stack locations, heap locations, and registers.
  • Long-lived storage locations are always heap locations.
  • Short-lived storage locations are always stack locations or registers.
  • There are some situations in which it is difficult for the compiler or runtime to determine whether a particular storage location is short-lived or long-lived. In those cases, the prudent decision is to treat them as long-lived. In particular, the storage locations of instances of reference types are always treated as though they are long-lived, even if they are provably short-lived. Therefore they always go on the heap.

And now things follow very naturally:

  • We see that references and instances of value types are essentially the same thing as far as their storage is concerned; they go on either the stack, in registers, or the heap depending on whether the storage of the value needs to be short-lived or long-lived.
  • It is frequently the case that array elements, fields of reference types, locals in an iterator block and closed-over locals of a lambda or anonymous method must live longer than the activation period of the method that first required the use of their storage. And even in the rare cases where their lifetimes are shorter than that of the activation of the method, it is difficult or impossible to write a compiler that knows that. Therefore we must be conservative: all of these storage locations go on the heap.
  • It is frequently the case that local variables and temporary values can be shown via compile-time analysis to be unused after the activation period ends, and therefore can be treated short-lived, and therefore can go onto the stack or put into registers.

Once you abandon entirely the crazy idea that the type of a value has anything whatsoever to do with the storage, it becomes much easier to reason about it. Of course, my point above stands: you don’t need to reason about it unless you are writing unsafe code or doing some sort of heavy interoperating with unmanaged code. Let the compiler and the runtime manage the lifetime of your storage locations; that’s what its good at.