Guidelines and rules for GetHashCode

The code is more what you’d call guidelines than actual rules” – truer words were never spoken. It’s important when writing code to understand what are vague “guidelines” that should be followed but can be broken or fudged, and what are crisp “rules” that have serious negative consequences for correctness and robustness. I often get questions about the rules and guidelines for GetHashCode, so I thought I might summarize them here.

What is GetHashCode used for?

It is by design useful for only one thing: putting an object in a hash table. Hence the name.

Why do we have this method on Object in the first place?

It makes perfect sense that every object in the type system should provide a GetType method; data’s ability to describe itself is a key feature of the CLR type system. And it makes sense that every object should have a ToString, so that it is able to print out a representation of itself as a string, for debugging purposes. It seems plausible that objects should be able to compare themselves to other objects for equality. But why should it be the case that every object should be able to hash itself for insertion into a hash table? Seems like an odd thing to require every object to be able to do.

I think if we were redesigning the type system from scratch today, hashing might be done differently, perhaps with an IHashable interface. But when the CLR type system was designed there were no generic types and therefore a general-purpose hash table needed to be able to store any object.

How do hash tables and similar data structures use GetHashCode?

Consider a “set” abstract data type. Though there are many operations you might want to perform on a set, the two basic ones are insert a new item into the set, and check to see whether a given item is in the set. We would like these operations to be fast even if the set is large. Consider, for example, using a list as an implementation detail of a set:

class Set
{
  private List list = new List();
  public void Insert(T item)
  {
    if (!Contains(t))
      list.Add(item);
  }
  public bool Contains(T item)
  {
    foreach(T member in list)
      if (member.Equals(item))
        return true;
    return false;
  }
}

(I’ve omitted any error checking throughout this article; we probably want to make sure that the item is not null. We probably want to implement some interfaces, and so on. I’m keeping things simple so that we concentrate on the hashing part.)

The test for containment here is linear; if there are ten thousand items in the list then we have to look at all ten thousand of them to determine that the object is not in the list. This does not scale well.

The trick is to trade a small amount of increased memory burden for a huge amount of increased speed. The idea is to make many shorter lists, called “buckets”, and then be clever about quickly working out which bucket we’re looking at:

class Set
{
  private List[] buckets = new List[100];
  public void Insert(T item)
  {
    int bucket = GetBucket(item.GetHashCode());
    if (Contains(item, bucket))
      return;
    if (buckets[bucket] == null)
    buckets[bucket] = new List();
    buckets[bucket].Add(item);
  } 
  public bool Contains(T item)
  {
    return Contains(item, GetBucket(item.GetHashCode());
  }
  private int GetBucket(int hashcode)
  {
    unchecked
    {
      // A hash code can be negative, and thus its remainder can be negative also.
      // Do the math in unsigned ints to be sure we stay positive.
      return (int)((uint)hashcode % (uint)buckets.Length);
    }
  }
  private bool Contains(T item, int bucket)
  {
    if (buckets[bucket] != null)
    foreach(T member in buckets[bucket])
      if (member.Equals(item))
        return true;
    return false;
  }
}

Now if we have ten thousand items in the set then we are looking through one of a hundred buckets, each with on average a hundred items; the Contains operation just got a hundred times cheaper.

On average.

We hope.

We could be even more clever here; just as a List resizes itself when it gets full, the bucket set could resize itself as well, to ensure that the average bucket length stays low. Also, for technical reasons it is often a good idea to make the bucket set length a prime number, rather than 100. There are plenty of improvements we could make to this hash table. But this quick sketch of a naive implementation of a hash table will do for now. I want to keep it simple.

By starting with the position that this code should work, we can deduce what the rules and guidelines must be for GetHashCode:

Rule: equal items have equal hashes

If two objects are equal then they must have the same hash code; or, equivalently, if two objects have different hash codes then they must be unequal.

The reasoning here is straightforward. Suppose two objects were equal but had different hash codes. If you put the first object in the set then it might be put into bucket #12. If you then ask the set whether the second object is a member, it might search bucket #67, and not find it.

Note that it is not a rule that if two objects have the same hash code, then they must be equal. There are only four billion or so possible hash codes, but obviously there are more than four billion possible objects. There are far more than four billion ten-character strings alone. Therefore there must be at least two unequal objects that share the same hash code, by the Pigeonhole Principle: if you have four billion pigeon holes to hold more than four billion pigeons then at least one pigeonhole has two pigeons.

Guideline: the integer returned by GetHashCode should never change

Ideally, the hash code of a mutable object should be computed from only fields which cannot mutate, and therefore the hash value of an object is the same for its entire lifetime.

However, this is only an ideal-situation guideline; the actual rule is:

Rule: the integer returned by GetHashCode must never change while the object is contained in a data structure that depends on the hash code remaining stable

It is permissible, though dangerous, to make an object whose hash code value can mutate as the fields of the object mutate. If you have such an object and you put it in a hash table then the code which mutates the object and the code which maintains the hash table are required to have some agreed-upon protocol that ensures that the object is not mutated while it is in the hash table. What that protocol looks like is up to you.

If an object’s hash code can mutate while it is in the hash table then clearly the Contains method stops working. You put the object in bucket #5, you mutate it, and when you ask the set whether it contains the mutated object, it looks in bucket #74 and doesn’t find it.

Remember, objects can be put into hash tables in ways that you didn’t expect. A lot of the LINQ sequence operators use hash tables internally. Don’t go dangerously mutating objects while enumerating a LINQ query that returns them!

Rule: Consumers of GetHashCode cannot rely upon it being stable over time or across appdomains

Suppose you have a Customer object that has a bunch of fields like Name, Address, and so on. If you make two such objects with exactly the same data in two different processes, they do not have to return the same hash code. If you make such an object on Tuesday in one process, shut it down, and run the program again on Wednesday, the hash codes can be different.

This has bitten people in the past. The documentation for System.String.GetHashCode notes specifically that two identical strings can have different hash codes in different versions of the CLR, and in fact they do. Don’t store string hashes in databases and expect them to be the same forever, because they won’t be.

Rule: GetHashCode must never throw an exception, and must return

Getting a hash code simply calculates an integer; there’s no reason why it should ever fail. An implementation of GetHashCode should be able to handle any legal configuration of the object.

I occasionally get the response “but I want to throw NotImplementedException in my GetHashCode to ensure that the object is never put into a hash table; I don’t intend for this object to ever be put into a hash table.”  Well, OK, but the last sentence of the previous guideline applies; this means that your object cannot be a result in many LINQ-to-objects queries that use hash tables internally for performance reasons.

Since it doesn’t throw an exception, it has to return a value eventually. It’s not legal or smart to make an implementation of GetHashCode that goes into an infinite loop.

This is particularly important when hashing objects that might be recursively defined and contain circular references. If hashing object Alpha hashes the value of property Beta, and hashing Beta turns right around and hashes Alpha, then you’re going to either loop forever (if you’re on an architecture that can optimize tail calls) or run out of stack and crash the process.

Guideline: the implementation of GetHashCode must be extremely fast

The whole point of GetHashCode is to optimize a lookup operation; if the call to GetHashCode is slower than looking through those ten thousand items one at a time, then you haven’t made a performance gain.

I classify this as a “guideline” and not a “rule” because it is so vague. How slow is too slow? That’s up to you to decide.

Guideline: the implementation of GetHashCode must be performant

Speed is only one kind of performance; consider a hash function that allocates a bunch of heap memory as it is computing a hash code. (For instance, by concatenating strings and hashing the resulting string, rather than hashing the strings separately and combining the results.) That generates collection pressure, which makes collections more likely, which maybe then slows down some crucial code that is going to run later.

Guideline: the distribution of hash codes must be “random”

By a “random distribution” I mean that if there are commonalities in the objects being hashed, there should not be similar commonalities in the hash codes produced. Suppose for example you are hashing an object that represents the latitude and longitude of a point. A set of such locations is highly likely to be “clustered”; odds are good that your set of locations is, say, mostly houses in the same city, or mostly valves in the same oil field, or whatever. If clustered data produces clustered hash values then that might decrease the number of buckets used and cause a performance problem when the bucket gets really big.

Again, I list this as a guideline rather than a rule because it is somewhat vague, not because it is unimportant. It’s very important. But since good distribution and good speed can be opposites, it’s important to find a good balance between the two.

I know this from deep, personal, painful experience. Over a decade ago I wrote a string hash algorithm for a table used by the msn.com backend servers. I thought it was a reasonably randomly distributed algorithm, but I made a mistake and it was not. It turned out that all of the one hundred thousand strings that are five characters long and contain only numbers were always hashed to one of five buckets, instead of any of the six hundred or so buckets that were available. The msn.com guys were using my table to attempt to do fast lookups of tens of thousands of US postal codes, all of which are strings of five digits. Between that and a threading bug in the same code, I wrecked the performance of an important page on msn.com; this was both costly and embarrassing. Data is sometimes heavily clustered, and a good hash algorithm will take that into account.

In particular, be careful of “xor”. It is very common to combine hash codes together by xoring them, but that is not necessarily a good thing. Suppose you have a data structure that contains strings for shipping address and home address. Even if the hash algorithm on the individual strings is really good, if the two strings are frequently the same then xoring their hashes together is frequently going to produce zero. “xor” can create or exacerbate distribution problems when there is redundancy in data structures.

Security issue: if the hashed data can be chosen by an attacker then you might have a problem on your hands

When I wrecked that page on msn.com it was an accident that the chosen data interacted poorly with my algorithm. But suppose the page was in fact collecting data from users and storing it in a hash table for server-side analysis. If the user is hostile and can deliberately craft huge amounts of data that always hashes to the same bucket then they can mount a denial-of-service attack against the server by making the server waste a lot of time looking through an unbalanced hash table. If that’s the situation you are in, consult an expert. It is possible to build hostile-data-resistant implementations of GetHashCode but doing so correctly and safely is a job for an expert in that field.

Security issue: do not use GetHashCode “off label”

GetHashCode is designed to do only one thing: balance a hash table. Do not use it for anything else. In particular:

and so on.

Getting all this stuff right is surprisingly tricky.

No backtracking, part two

As i was saying last time, the nice thing about “no backtracking” is that it makes the language much easier to understand. Simple rules benefit both the compiler and the code reader; both are attempting to read the code to make sense of it. It is not always a good idea to take advantage of the compiler’s ability to search a large space if that makes it harder for a human to understand the code.

Suppose you have something like this mess: (*)

namespace XYZ.DEF

    public class GHI {} 

namespace QRS.DEF.GHI 

    public class JKL { } 

… in another file …

using QRS; 
namespace TUV  
{
    using XYZ;
    namespace ABC
    {
        namespace DEF
        {
            class GHI { }
            class MNO : DEF.GHI.JKL { }
        }
    }

And now we must work out the base type of MNO.

With no backtracking we say “The nearest container of MNO that has a member DEF is ABC, therefore DEF means ABC.DEF”. Therefore GHI is ABC.DEF.GHI. Therefore JKL is ABC.DEF.GHI.JKL, which does not exist, therefore, give an error. The developer must fix the error by giving a type name that lets the compiler identify which DEF you meant.

If we had backtracking, what would we have to do? We’d get that error, and then we’d backtrack. Does XYZ contain a DEF? Yes. Does it contain a GHI? Yes. Does it contain a JKL? No. Backtrack again. Does QRS contain an DEF.GHI.JKL? Yes.

That works, but can we logically conclude from the fact that it works that it is the one the user meant?

Who the heck knows in this crazy situation? We got all kinds of good bindings in there that then went bad very late in the game. The idea that we stumbled upon the desired answer after going down so many blind alleys seems highly suspect. Maybe there is yet another choice in there that is the one the user meant. We cannot know that unless we try all of them, and again, that could involve a lot of searching.

The correct thing to do here is not to backtrack multiple times and try out all kinds of worse bindings for every stage of the lookup. The correct thing to do is to say “buddy, the best possible match for this lookup gives nonsensical results; give me something less ambiguous to work with here please.”

An unfortunate fact about writing a language where the compiler by design complains loudly if the best match is something that doesn’t work, is that developers frequently say “well, sure, in general I want the compiler to point out all my mistakes — or, rather, all my coworker’s mistakes. But for this specific case, I know what I am doing, so please, compiler, do what I mean, not what I say.”

Trouble is, you can’t have it both ways. You can’t have both a compiler that both enforces rigid rules that make it highly likely that suspicious code will be aggressively identified as erroneous and allow crazy code via compiler heuristics that figure out “what I really meant” when you write something that the compiler quite rightly sees as ambiguous or wrong.

I could discuss many more places where we could do backtracking but do not. Method type inference, for example, always either makes progress or fails; it never backtracks in C# (**). But I think I will leave it at that. Except to say that there is one place where we do use backtracking, and that’s analysis of overload resolution in nested lambdas. I wrote about that here.

(*) When reading this remember that in C#, “namespace XYZ.DEF { }” is a short form for “namespace XYZ{ namespace DEF { } }”

(**) Unlike in, say, F#.


Commentary from 2021:

There were a number of good comments to the original post, which pointed out:

  • Error analysis often has to perform some sort of backtracking to give a good error. In the examples I gave, for instance, we could run a backtracking analysis and then give a “did you mean…?” style error message if backtracking finds a solution, and a different error if no solution can be found. That’s a great idea in principle, but in practice we often end up in situations where backtracking is expensive, so we’d want to put a box around the amount of time and resources the error reporting system uses.
  • This is even more the case when IntelliSense is involved. IntelliSense by definition is typically operating on code that is wrong; if the code was right, why are you editing it? IntelliSense needs to give good options, but it needs to do so in under 30 milliseconds, so it uses lots of tricks to meet its time budget.
  • There was also some good discussion of the deeper language design point here. There are times when we want to use backtracking algorithms, and times we do not; what motivates that choice? There is not just one thing; like all design and implementation problems it comes down to resolving conflicts between incompatible principles. One of the nice things about “no backtracking” algorithms is that they often allow “local” analysis; if we see ABC.DEF.GHI and immediately after, ABC.DEF.XYZ, we know that no matter what, the ABC.DEF means the same thing; the meanings of ABC and ABC.DEF do not change depending on whether GHI and XYZ produce errors or not. That’s a nice property to have.

No backtracking, part one

A number of the articles I’ve published over the years involve “backtracking” algorithms; most recently my series on how to solve Sudoku puzzles (and more generally, all graph colouring problems) by backtracking. The idea of a backtracking algorithm is really simple and powerful: when faced with a choice, try every possibility. If all of them go wrong, backtrack to the previous choice and try a different possibility. Backtracking algorithms are essentially a depth-first search of the space of possible solutions.

But what if the solution space is large? There are more ways to fill in nine digits in 81 places than there are protons in the entire Universe; we can’t check them all. The reason that backtracking works so rapidly for Sudoku puzzles is because typically every guess eliminates many of the branches of the solution tree. If you guess that the first square is 1 then you do not need to explore the possibility that the second square is 1; you can just prune that branch and never explore it. Since that branch itself is vast, that’s a good strategy! (There are pathological cases for this algorithm, but in practice it is fast.)

Backtracking algorithms work well for many situations, but we have consciously eschewed them in the design of C#. (With a few exceptions). For example, there is no backtracking that ever crosses “phases” of compilation in C#.(*) When we are presented with some source code the first thing we do is “lex” it; break it up into “tokens”. This is akin to finding the words in a sentence by looking at the spacing and punctuation. Then we do a grammatical analysis of the token stream, and then we do a semantic analysis of the grammatical analysis. When one of those phases finds an error, we never backtrack to a previous phases and ask for more analysis. Suppose for example you have this silly code:

x = j+++++1;

That is lexically correct C# code. The lexer is a greedy lexer; it tries to make the largest token it can at any given time. So it tokenizes this as:

x = j ++ ++ + 1 ;

Now the grammar analyzer takes that token stream and says “this is an expression involving two identifiers, one constant and four operators; how should I parenthesize that?” The grammar analyzer determines using the rules of precedence and associativity that this means

x=(((j++)++)+1);

So this means “evaluate x, then increment j, then increment whatever the previous result was, then add 1 to the previous result, then assign the previous result to x, done”

Then the semantic analyzer checks that to make sure that all those operations make sense. They do not, because the result of j++ is a value, not a variable, but the second ++ requires a variable. The semantic analyzer gives an error, and the program does not compile.

If we had a backtracking compiler, the semantic analyzer could tell the parser “no, you did something wrong, give me a different parse tree if there is one”.

Suppose the parser does that. It could say well, maybe that last + was not a binary operator but rather the unary operator + applied to 1, oh, but if we do that then we have two expressions next to each other with no intervening operator. Same thing if the second ++ applies to the +1 instead of the j++. So no, there is no other parse of this. So push back on the lexer and ask it to backtrack.

The lexer could then say, oh, sure, if you don’t like that then maybe my guess that the second ++ was ++ was wrong. Try this one:

x = j ++ + ++ 1 ;

and then the parser says OK, if that’s where the token breaks are then the grammatical analysis is:

x=((j++)+(++1));

and the semantic analyzer determines that’s wrong too, this time because you can’t have a ++ on a constant. So the semantic analyzer tells the parser, no, that’s not it either, and the parser tells the lexer, no, that’s not it either. And the lexer says “well, maybe that wasn’t a ++ at all the second time, maybe that should be

x = j ++ + + + 1 ;

And now the parser says “oh, I see what that is, it is two unary plus operators applied to 1 and an addition operator:

x=((j++)+(+(+1)));

and the semantic analyzer says “finally! something I can work with.”

That’s not what we do. This is a silly example, but perhaps you can see that there are two major problems with this strategy.

First, though most of the time lexing is unambiguous, in some pathological cases there are potentially enormously many ways that simple programs could be lexed. Just determining where the breaks could go between five plus signs gives eight possibilities; it grows exponentially from there. Since most of the time programs lex unambiguously, and when they do lex ambiguously, the space to explore could be really big, it’s better to always lex greedily and not backtrack ever. If a program fails grammatical analysis because of an unintended lexical structure then the best thing to do is tell the user and they’ll fix it.

The second major problem is that when backtracking, how do you know which backtracking is better? Of those eight possibilities, two give program fragments that pass semantic analysis: the one we just found, and x=(j+(+(+(+(+1))))); Is it at all clear which of the two possibilities is the one the user meant when they wrote this ridiculous statement? Remember, we’re not trying to solve a Sudoku puzzle here, where any given combination of numbers is as meaningless as the next; we are attempting to output IL that has the same meaning as the meaning intended by the user! One of these possibilities mutates j and the other does not, so it makes a difference. C# is not a “guess what you meant” language, it’s a “do what you said” language(**). If what you said was ambiguous, we should tell you about it.


Next time on FAIC: we’ll return to this point by looking at how the name resolution algorithm also does not backtrack.

(*) Unlike JavaScript, interestingly enough. The ECMAScript specification notes that there are lexical ambiguities involving the / character; it could be introducing a comment as // or as /*, or closing a comment as */, or could be a normal division operator, or a /= operator, or opening or closing a regular expression. Some of those overlap; The spec actually says that the correct lex is the one that produces a grammatically sensible program!

(**) Again, unlike JavaScript.


Commentary from 2021:

There were a lot of good comments (and some needlessly pedantic comments, imagine that) on the original article that went down a few ratholes mentioned above:

  • There were some points made about the difficulty of having lexical ambiguities involving the / symbol in JS that are exacerbated by automatic semicolon insertion
  • Some readers pointed out that sudoku puzzles are required by the rules of the puzzle to have a unique solution; though that is true, I was just using it as an example of a problem amenable to a backtracking algorithm; I was not intending to draw a deeper analogy between solving sudoku puzzles and building a compiler. (Fun fact though: it is possible to encode a sudoku puzzle into a C# program such that the compiler is required to find a solution or produce an error if the puzzle is ambiguous or unsolvable.)
  • Some readers asked about scenarios that involve a kind of backtracking in the parser, but not “across phases”. For example, consider: (T)-1 — does that parse as a subtraction or as a cast of -1? Or my favourite: A(B<C,D>(E)) which parses as two arguments to A in C# 1, and one in C# 2. In both these situations the parser is required to “look ahead” arbitrarily far in the token stream to determine how the parse works; most of the time the parser only has to look ahead one token to determine what the parse should be. A “look ahead” and a “backtrack” are kinda sorta the same thing. Similarly there are scenarios where it is not clear whether >> is a shift operator or the end of a nested generic type.

In 2016 I worked on rearchitecting the Hack parser; Hack is a gradually-typed variant of PHP used inside Facebook. One of the major difficulties I faced in writing a purely functional, full-fidelity Hack parser is that Hack has XML literals in the language, much like VB. The lexical and grammatical rules of the language are completely different inside these XML regions and I did end up having to build some cross-phase backtracking into the parser, which was quite vexing.

The Truth About Value Types

As you know if you’ve read this blog for a while, I’m disturbed by the myth that “value types go on the stack”. Unfortunately, there are plenty of examples in our own documentation and in many books that reinforce this myth, either subtly or overtly. I’m opposed to it because:

  1. It is usually stated incorrectly: the statement should be “value types can be stored on the stack”, instead of the more common “value types are always stored on the stack”.
  2. It is almost always irrelevant. We’ve worked hard to make a managed environment where the distinctions between different kinds of storage are hidden from the user. Unlike some languages, in which you must know whether a particular storage is on the stack or the heap for correctness reasons.
  3. It is incomplete. What about references? References are neither value types nor instances of reference types, but they are values. They’ve got to be stored somewhere. Do they go on the stack or the heap? Why does no one ever talk about them? Just because they don’t have a type in the C# type system is no reason to ignore them.

The way in the past I’ve usually pushed back on this myth is to say that the real statement should be “in the Microsoft implementation of C# on the desktop CLR, value types are stored on the stack when the value is a local variable or temporary that is not a closed-over local variable of a lambda or anonymous method, and the method body is not an iterator block, and the jitter chooses to not enregister the value.”

The sheer number of weasel words in there is astounding, but they’re all necessary:

  • Versions of C# provided by other vendors may choose other allocation strategies for their temporary variables; there is no language requirement that a data structure called “the stack” be used to store locals of value type.
  • We have many versions of the CLI that run on embedded systems, in web browsers, and so on. Some may run on exotic hardware. I have no idea what the memory allocation strategies of those versions of the CLI are. The hardware might not even have the concept of “the stack” for all I know. Or there could be multiple stacks per thread. Or everything could go on the heap.
  • Lambdas and anonymous methods hoist local variables to become heap-allocated fields; those are not on the stack anymore.
  • Iterator blocks in today’s implementation of C# on the desktop CLR also hoist locals to become heap-allocated fields. They do not have to! We could have chosen to implement iterator blocks as coroutines running on a fiber with a dedicated stack. In that case, the locals of value type could go on the stack of the fiber.
  • People always seem to forget that there is more to memory management than “the stack” and “the heap”. Registers are neither on the stack or the heap, and it is perfectly legal for a value type to go in a register if there is one of the right size. If if is important to know when something goes on the stack, then why isn’t it important to know when it goes in a register? Conversely, if the register scheduling algorithm of the jit compiler is unimportant for most users to understand, then why isn’t the stack allocation strategy also unimportant?

Having made these points many times in the last few years, I’ve realized that the fundamental problem is in the mistaken belief that the type system has anything whatsoever to do with the storage allocation strategy. It is simply false that the choice of whether to use the stack or the heap has anything fundamentally to do with the type of the thing being stored. The truth is: the choice of allocation mechanism has to do only with the known required lifetime of the storage.

Once you look at it that way then everything suddenly starts making much more sense. Let’s break it down into some simple declarative sentences.

  • There are three kinds of values: (1) instances of value types, (2) instances of reference types, and (3) references. (Code in C# cannot manipulate instances of reference types directly; it always does so via a reference. In unsafe code, pointer types are treated like value types for the purposes of determining the storage requirements of their values.)
  • There exist “storage locations” which can store values.
  • Every value manipulated by a program is stored in some storage location.
  • Every reference (except the null reference) refers to a storage location.
  • Every storage location has a “lifetime”. That is, a period of time in which the storage location’s contents are valid.
  • The time between a start of execution of a particular method and the method returning normally or throwing an exception is the “activation period” of that method execution.
  • Code in a method can require the use of a storage location. If the required lifetime of the storage location is longer than the activation period of the current method execution then the storage is said to be “long lived”. Otherwise it is “short lived”. (Note that when method M calls method N, the use of the storage locations for the parameters passed to N and the value returned by N is required by M.)

Now we come to implementation details. In the Microsoft implementation of C# on the CLR:

  • There are three kinds of storage locations: stack locations, heap locations, and registers.
  • Long-lived storage locations are always heap locations.
  • Short-lived storage locations are always stack locations or registers.
  • There are some situations in which it is difficult for the compiler or runtime to determine whether a particular storage location is short-lived or long-lived. In those cases, the prudent decision is to treat them as long-lived. In particular, the storage locations of instances of reference types are always treated as though they are long-lived, even if they are provably short-lived. Therefore they always go on the heap.

And now things follow very naturally:

  • We see that references and instances of value types are essentially the same thing as far as their storage is concerned; they go on either the stack, in registers, or the heap depending on whether the storage of the value needs to be short-lived or long-lived.
  • It is frequently the case that array elements, fields of reference types, locals in an iterator block and closed-over locals of a lambda or anonymous method must live longer than the activation period of the method that first required the use of their storage. And even in the rare cases where their lifetimes are shorter than that of the activation of the method, it is difficult or impossible to write a compiler that knows that. Therefore we must be conservative: all of these storage locations go on the heap.
  • It is frequently the case that local variables and temporary values can be shown via compile-time analysis to be unused after the activation period ends, and therefore can be treated short-lived, and therefore can go onto the stack or put into registers.

Once you abandon entirely the crazy idea that the type of a value has anything whatsoever to do with the storage, it becomes much easier to reason about it. Of course, my point above stands: you don’t need to reason about it unless you are writing unsafe code or doing some sort of heavy interoperating with unmanaged code. Let the compiler and the runtime manage the lifetime of your storage locations; that’s what its good at.

Is is as or is as is?

Today a question about the is and as operators: is the is operator implemented as a syntactic sugar for the as operator, or is the as operator implemented as a syntactic sugar for the is operator? More briefly, is is as or is as is?

Perhaps some sample code would be more clear. This code

bool b = x is Foo;

could be considered as a syntactic sugar for

bool b = (x as Foo) != null;

in which case is is a syntactic sugar for as. Similarly,

Foo f = x as Foo;

could be considered to be a syntactic sugar for

var temp = x;
Foo f = (temp is Foo) ? (Foo)temp : (Foo)null;

in which case as is a syntactic sugar for is. Clearly we cannot have both of these be sugars because then we have an infinite regress!

The specification is clear on this point: as (in the non-dynamic case) is defined as a syntactic sugar for is.

However, in practice the CLR provides us instruction isinst, which ironically acts like as. We have an instruction which implements the semantics of as pretty well, from which we can build an implementation of is. In short, de jure is is is, and as is as is is, but de facto is is as and as is isinst.

I now invite you to leave the obvious jokes about President Clinton in the comments.


Commentary from 2020:

  • This article was based on a serious question I had been asked, but of course I relished the opportunity to make silly sentences such as “is is as or is as is?”
  • The bit at the end about President Clinton confused some readers, understandably. Read the explainer here for details. The TLDR is: President Clinton famously attempted to explain a lie with “it depends on what the meaning of ‘is’ is”.
  • In the original comments a reader asked why anyone should care about this level of implementation detail. My response to that question gets straight to the philosophy of why I’ve written a blog for 17 years:

I reject the premise of the question; I’m not saying that you or anyone else should care about this detail. I do not necessarily write about things that are important for developers to understand; I write about things that are fabulous! And the person who gets to decide what is fabulous is me. If some of those fabulous things happen to be important, that’s super, but that’s not my aim. The name of the blog is not “Important Stuff You Should Know About Coding“.


I blog because I’ve learned a bunch of interesting stuff — interesting to me — and I like to share things that I find interesting.

    • Another good question in the comments was “does the compiler statically analyze is and as expressions?”  Yes! If you write something like if(x is string) and we already know that x is some class type Foo then the expression will always be false and you get a warning.However, there are subtle problems here. The rules of isinst and the rules of C# are slightly different. In particular, it is illegal in C# to convert int[] to uint[] via reference conversion, but legal in the CLR. You can therefore get into situations where the compiler will warn about “impossible” situations that are actually possible.
    • Several readers pointed out that there is a missing language feature: once we know that x is Foo is true, it should be legal to use the members of Foo from x. A form of this feature was added to C# 6; if(x is Foo f)

Don’t repeat yourself; consts are already static

Today, another entertaining question from StackOverflow. Presented again as a dialogue, as is my wont.

The specification says “even though constants are considered static members, a constant-declaration neither requires nor allows a static modifier.” Why was the decision made to not force constants to use the static modifier if they are considered static?

Let’s grant that it is sensible that constants are considered to be “static” – that is, members associated with the type itself, rather than members of a particular instance. Let’s also grant that “const” has to be in there somewhere. There are three possible choices for how to notate a constant declaration:

1) Make static optional: “const int x…” or “static const int x…” are both legal.
2) Make static required: “const int x…” is illegal, “static const int x…” is legal
3) Make static illegal: “const int x…” is legal, “static const int x…” is illegal.

Does that pretty much sum it up?

Yes. Why did the design team choose option (3) instead of (1) or (2)?

The design notes from 1999 do not say. But we can deduce what was probably going through the language designer’s heads.

The problem with (1) is that you could read code that uses both “const int x…” and “static const int y…” and then you would naturally ask yourself “what’s the difference?” Since the default for non-constant fields and methods is “instance” unless marked “static”, the natural conclusion would be that some constants are per-instance and some are per-type, and that conclusion would be wrong. This is bad because it is misleading.

The problem with (2) is that first off, it is redundant. It’s just more typing without adding clarity or expressiveness to the language. And second, I don’t know about you, but I personally hate it when the compiler gives me the error “You forgot to say the magic word right here. I know you forgot to say the magic word, I am one hundred percent capable of figuring out that the magic word needs to go there, but I’m not going to let you get any work done until you say the magic word”. I once heard that tendency of compilers compared to the guy at the party who interrupts your conversation to point out that you said “a historic occasion” when clearly you meant to say “an historic occasion“. No one likes that guy.

The problem with (3) is that the developer is required to know that const logically implies static. However, once the developer learns this fact, they’ve learned it. It’s not like this is a complex idea that is hard to figure out.

The solution which presents the fewest problems and costs to the end user is (3).

That seems reasonable. Does the C# language apply this principle of eschewing redundancy consistently?

Nope! It is interesting to compare and contrast this with other places in the language where different decisions were made.

For example, overloaded operators are required to be both public and static. In this case, again we are faced with three options:

(1) make public static optional,
(2) make it required, or
(3) make it illegal.

For overloaded operators we chose (2). Since the “natural state” of a method is private/instance it seems bizarre and misleading to make something that looks like a method public/static invisibly, as (1) and (3) both require.

For another example, a virtual method with the same signature as a virtual method in a base class is supposed to have either “new” or “override” on it. Again, three choices.

(1) make it optional: you can say new, or override, or nothing at all, in which case we default to new.
(2) make it required: you have to say new or override, or
(3) make it illegal: you cannot say new at all, so if you don’t say override then it is automatically new.

In this case we chose (1) because that works best for the brittle base class situation of someone adds a virtual method to a base class that you don’t realize you are now overriding. This produces a warning, but not an error.

Each of these situations has to be considered on a case-by-case basis. The general guidance is not , say “always pick option 3”, but rather, to figure out what is least misleading to the typical user and go with that.

Knights, knaves, protected and internal

When you override a virtual method in C# you are required to ensure that the stated accessibility of the overridden method – that is, whether it is public, internal, protected or protected internal(*) – is exactly re-stated in the overriding method. Except in one case. I refer you to section 10.6.4 of the specification, which states:

an override declaration cannot change the accessibility of the virtual method. However, if the overridden base method is protected internal and it is declared in a different assembly than the assembly containing the override method then the override method’s declared accessibility must be protected.

What the heck is up with that? Surely if an overridden method is protected internal then it only makes sense that the overriding method should be exactly the same: protected internal.

I’ll explain why we have this rule, but first, a brief digression.

A certain island is inhabited by only knights and knaves. Knights make only true statements and only answer questions truthfully; knaves make only false statements and only answer questions untruthfully. If you walk up to an inhabitant of the (aptly-named) Island of Knights and Knaves you can rapidly ascertain whether a particular individual is a knight or a knave by asking a question you know the answer to. For example “does two plus two equal four?” A knight will answer “yes” (**), and a knave will answer “no”. Knaves are prone to saying things like “my mother is a male knight”, which is plainly false.

It might seem at first glance that there is no statement which could be made by both a knight and a knave. Since knights tell the truth and knaves lie, they cannot both make the same statement, right? But in fact there are many statements that can be made by both. Can you think of one?

.

.

.

.

.

.

.

Both a knight and a knave can say “I am a knight.”

How does that work? The reason this works is because the pronoun “I” refers to different people when uttered by different people. If Alice, a knight, makes the statement “I am a knight”, she is asserting the truth that “Alice is a knight”. If Bob, a knave, makes the statement “I am a knight”, he is not asserting the true statement “Alice is a knight” but rather the false statement “Bob is a knight”. Similarly, both Alice and Bob can assert the statement “My name is Alice,” for the same reason.

And that’s why overriding methods in a different assembly aren’t “protected internal”. The modifier “internal” is like a pronoun; it refers to the current assembly. When used in two different assemblies it means different things. The purpose of the rule is to ensure that a derived class does not make the accessibility domain of the virtual member any larger or smaller.

An analogy might help. Suppose a protected resource is a car, an assembly is a dwelling, a person is a class and descendent is a derived class.

  • Alice has a Mazda and lives in House with her good friend Charlie.
  • Charlie has a child, Diana, who lives in Apartment.
  • Alice has a child, Elroy, who lives in Condo with his good friend Greg.
  • Elroy has a child – Alice’s grandchild — Frank, who lives in Yurt.

Alice grants access to Mazda to anyone living in House and any descendent of Alice. The people who can access Mazda are Alice, Charlie, Elroy, and Frank.

Diana does not get access to Mazda because she is not Alice, not a descendent of Alice, and not a resident of House. That she is a child of Alice’s housemate is irrelevant.

Greg does not get access to Mazda for the same reason: he is not Alice, not a descendent of Alice, and not a resident of House. That he is a housemate of a descendent of Alice is irrelevant.

Now we come to the crux of the matter. Elroy is not allowed to extend his access to Mazda to Greg. Alice owns that Mazda and she said “myself, my descendents and my housemates”. Her children don’t have the right to extend the accessibility of Mazda beyond what she initially set up. Nor may Elroy deny access to Frank; as a descendent of Alice, Frank has a right to borrow the car and Greg cannot stop him by making it “private”.

When Elroy describes what access he has to Mazda he is only allowed to say “I grant access to this to myself and my descendents” because that is what Alice already allowed. He cannot say “I grant access to Mazda to myself, my descendents and to the other residents of Condo”.


(*) Private virtual methods are illegal in C#, which irks me to no end. I would totally use that feature if we had it.

(**) Assuming that they answer at all, of course; there could be mute knights and knaves.

What’s the difference between covariance and assignment compatibility?

I’ve written a lot about this already, but I think one particular point bears repeating.

As we’re getting closer to shipping C# 4.0, I’m seeing a lot of documents, blogs, and so on, attempting to explain what “covariant” means. This is a tricky word to define in a way that is actually meaningful to people who haven’t already got degrees in category theory, but it can be done. And I think it’s important to avoid defining a word to mean something other than its actual meaning.

A number of those documents have led with something like:

“Covariance is the ability to assign an expression of a more specific type to a variable of a less specific type. For example, consider a method M that returns a Giraffe. You can assign the result of M to a variable of type Animal, because Animal is a less specific type that is compatible with Giraffe. Methods in C# are ‘covariant’ in their return types, which is why when you create a covariant interface, it is indicated with the keyword ‘out’ — the returned value comes ‘out’ of the method.”

But that’s not at all what covariance means. That’s describing “assignment compatibility” — the ability to assign a value of a more specific type to a storage of a compatible, less specific type is called “assignment compatibility” because the two types are compatible for the purposes of verifying legality of assignments.

So what does covariance mean then?

First off, we need to work out precisely what the adjective “covariant” applies to. I’m going to get more formal for a bit here but try to keep it understandable.

Let’s start by not even considering types. Let’s think about integers. (And here I am speaking of actual mathematical integers, not of the weird behaviour of 32-bit integers in unchecked contexts.) Specifically, we’re going to think about the ≤ relation on integers, the “less than or equal to” relation. (Recall that of course a “relation” is a function which takes two things and returns a bool which indicates whether the given relationship holds or does not hold.)

Now let’s think about a projection on integers. What is a projection? A projection is a function which takes a single integer and returns a new integer. So, for example, z → z + z is a projection; call it D for “double”.  So are z → 0 - z, call it N for “negate” and z → z * z, call it S for “square”.

Now, here’s an interesting question. Is it always the case that (x ≤ y) = (D(x) ≤ D(y))?  Yes, it is. If x is less than y, then twice x is less than twice y. If x is equal to y then twice x is equal to twice y. And if x is greater than y, then twice x is greater than twice y. The projection D preserves the direction of the comparison.

What about N? Is it always the case that (x ≤ y) = (N(x) ≤ N(y))?  Clearly not. 1 ≤ 2 is true, but -1 ≤ -2 is false.

But we notice that the reverse is always true!  (x ≤ y) = (N(y) ≤ N(x)). The projection N reverses the direction of the comparison.

What about S? Is it always the case that (x ≤ y) = (S(x) ≤ S(y))? No. -1 ≤ 0 is true, but S(-1) ≤ S(0) is false.

What about the opposite? Is it always the case that (x ≤ y) = (S(y) ≤ S(x)) ?

Again, no. 1 ≤ 2 is true, but S(2) ≤ S(1) is false.

The projection S does not preserve the direction of the comparison relation, and nor does it reverse it.

The projection D is “covariant” — it preserves the ordering relationship on integers. The projection N is “contravariant”. It reverses the ordering relationship on integers. The projection S does neither; it is “invariant”.

Now I hope it is more clear exactly what is covariant or contravariant. The integers themselves are not variant, and the comparison is not variant. It’s the projection that is covariant or contravariant with respect to the comparison.

So now let us abandon integers and think about reference types. Instead of the ≤ relation on integers, we have the ≤ relation on reference types. A reference type X is smaller than (or equal to) a reference type Y if a value of type X can be stored in a variable of type Y. That is, if X is “assignment compatible” with Y.

Now consider a projection from types to types. Say, the projection “T goes to IEnumerable<T>“.  That is, we have a projection that takes a type, say, Giraffe, and gives you back a new type, IEnumerable<Giraffe>. Is that projection covariant in C# 4 (with respect to assignment compatibility)?  Yes. It preserves the direction of ordering. A Giraffe may be assigned to a variable of type Animal, and a sequence of Giraffes may be assigned to a variable that can hold a sequence of Animals.

We can think of generic types as “blueprints” that produce constructed types. Let’s take the projection that takes a type T and produces IEnumerable<T> and simply call that projection “IEnumerable<T>“. We can understand from context when we say “IEnumerable<T> is covariant” what we mean is “the projection which takes a reference type T and produces a reference type IEnumerable<T> is a covariant projection on the assignment compatibility relation”. And since IEnumerable<T> only has one type parameter, it is clear from the context that we mean that the parameter to the projection is T. After all, it is a lot shorter to say “IEnumerable<T> is covariant” than that other mouthful.

So now we can define covariance, contravariance and invariance. A generic type I<T> is covariant (in T) if construction with reference type arguments preserves the direction of assignment compatibility. It is contravariant (in T) if it reverses the direction of assignment compatibility. And it is invariant if it does neither. And by that, we simply are saying in a concise way that the projection which takes a T and produces I<T> is a covariant/contravariant/invariant projection.


 UPDATE: My close personal friend (and fellow computer geek) Jen notes that in the Twilight series of novels, the so-called “werewolves” (who are not transformed by the full moon and therefore not actually werewolves) maintain their rigid social ordering in both wolf and human form; the projection from human to wolf is covariant in the social-ordering relation. She also notes that in high school, programming language geeks are at the bottom of the social order, but the projection to adulthood catapults them to the top of the social order, and therefore, growing up is contravariant. I am somewhat skeptical of the latter claim; the former, I’ll take your word for it. I suppose the question of how social order works amongst teenage werewolves who are computer geeks is a subject for additional research. Thanks Jen!


Original post.

Color Color

The original post of this article is archived here.

IMPORTANT NOTE FROM 2023: the C# rules regarding overload resolution of static members have changed since this article was written in 2009; it is accurate for C# 1 through 5, but this rule was changed in a more recent version; I do not offhand recall which. This article should be considered to be of historical interest; see more recent documentation for the current behaviour.


Pop quiz: What does the following code do when compiled and run?

class C
{
  public static void M(string x)
  {
    System.Console.WriteLine("static M(string)");
  }
  public void M(object s)
  {
    System.Console.WriteLine("M(object)");
  }
}
class Program
{
  static void Main()
  {
    C c = new C();
    c.M("hello");
  }
}

(1) writes static M(string)
(2) writes M(object)
(3) uh, dude, this code doesn’t even compile much less run
(4) something else

Think about that for a bit and then try it and see if you were right. Scroll down when you’re ready.

.

.

.

.

.

.

.

In option (1), the compiler could decide that the best match is the static one and let you call it through the instance, ignoring the instance value. In option (2), the compiler could decide that the object version is the best match and ignore the static one, even though the argument match is better. But neither of those actually happens; the rules of C# say that the compiler should pick the best match based on the arguments, and then disallow static calls that are through an instance! The actual result here is therefore (3):

error CS0176: Member 'C.M(string)' cannot be accessed with an instance reference; qualify it with a type name instead

What is up with this craziness? If you believe that the rule “static methods should never be accessed through instances” is a good rule – and it seems reasonable – then why doesn’t overload resolution remove the static methods from the candidate set when called through an instance? Why does it even allow the “string” version to be an applicable candidate in the first place, if the compiler knows that it can never possibly work?

I agree that this seems really goofy at first.

To explain why this is not quite as dumb as it seems, consider the following variation on the problem. Class C stays the same.

class B
{
  public C C = new C();
  public void N()
  {
    C.M("hello");
  }
}

What does a call to N on an instance of B do?

(1) writes static M(string)
(2) writes M(object)
(3) compilation error
(4) something else

Again, scroll down when you’ve decided.

.

.

.

.

.

.

A bit trickier now, isn’t it? Does C.M mean “call instance method M on the instance stored in this.C?” or does it mean “call static method M on type C”? Both are applicable!

Because of our goofy rule we do the right thing in this case. We first resolve the call based solely on the arguments and determine that the static method is the best match. Then we check to see if the “receiver” at the call site can be interpreted as being through the type. It can. So we make the static call and write “static M(string)”. If the instance version had been the best match then we would successfully call the instance method through the property.

The reason that the compiler does not remove static methods when calling through an instance is because the compiler does not necessarily know that you are calling through an instance. Because there are situations where it is ambiguous whether you’re calling through an instance or a type, we defer deciding which you meant until the best method has been selected.

Now, one could make the argument that in our first case, the receiver cannot possibly be a type. We could have further complicated the language semantics by having three overload resolution rules, one for when the receiver is known to be a type, one for when the receiver is known to not be a type, and one for when the receiver might or might not be a type. But rather than complicate the language further, we made the pragmatic choice of simply deferring the staticness check until after the bestness check. That’s not perfect but it is good enough for most cases and it solves the common problem.

The common problem is the situation that arises when you have a property of a type with the same name, and then try to call either a static method on the type, or an instance method on the property. (This can also happen with locals and fields, but local and field names typically differ in case from the names of their types.) The canonical motivating example is a public property named Color of type Color, so we call this “the Color Color problem”.

In real situations the instance and static methods typically have different names, so in the typical scenario you never end up calling an instance method when you expect to call a static method, or vice versa. The case I presented today with a name collision between a static and an instance method is relatively rare.

If you’re interested in reading the exact rules for how the compiler deals with the Color Color problem, see section 7.5.4.1 of the C# 3.0 specification.

foreach vs ForEach

A number of people have asked me why there is no Microsoft-provided ForEach sequence operator extension method. The List<T> class has such a method already of course, but there’s no reason why such a method could not be created as an extension method for all sequences. It’s practically a one-liner:

public static void ForEach<T>(
  this IEnumerable<T> sequence, 
  Action<T> action)
{
  // argument null checking omitted
  foreach(T item in sequence) 
    action(item);
}

My usual response to “why is feature X not implemented?” is that of course as Raymond Chen said “Features start out nonexistent and somebody has to make them happen.” All features are unimplemented until someone designs, implements, tests, documents and ships the feature, and no one has yet spent the money to do so in this case.

It’s not much money though. Yes, though I have famously pointed out that even small features can have large costs, this one really is dead easy, obviously correct, easy to test, and easy to document. Cost is always a factor of course, but the costs for this one really are quite small.

Of course, that cuts the other way too. If it is so cheap and easy, then you can do it yourself if you need it. And really what matters is not the low cost, but rather the net benefit. As we’ll see, I think the benefits are also very small, and therefore the net benefit might in fact be negative. But we can go a bit deeper here. I am philosophically opposed to providing such a method, for two reasons.

The first reason is that doing so violates the functional programming principles that all the other sequence operators are based upon. Clearly the sole purpose of a call to this method is to cause side effects. The purpose of an expression is to compute a value, not to cause a side effect. The purpose of a statement is to cause a side effect. The call site of this thing would look an awful lot like an expression (though, admittedly, since the method is void-returning, the expression could only be used in a “statement expression” context.) It does not sit well with me to make the one and only sequence operator that is only useful for its side effects.

The second reason is that doing so adds zero new representational power to the language. Doing this lets you rewrite this perfectly clear code:

foreach(Foo foo in foos){ statement involving foo; }

into this code:

foos.ForEach((Foo foo)=>{ statement involving foo; });

which uses almost exactly the same characters in slightly different order. And yet the second version is harder to understand, harder to debug, and introduces closure semantics, thereby potentially changing object lifetimes in subtle ways.

When we provide two subtly different ways to do exactly the same thing, we produce confusion in the industry, we make it harder for people to read each other’s code, and so on. Sometimes the benefit added by having two different textual representations for one operation (like query comprehensions versus their underlying method call form, or + versus String.Concat) is so huge that it’s worth the potential confusion. But the compelling benefit of query expressions is their readability; this new form of foreach is certainly no more readable than the normal form and is arguably worse.

If you don’t agree with these philosophical objections and find practical value in this pattern, by all means, go ahead and write this trivial one-liner yourself.


Notes from 2020

This episode was, oddly enough, one of the more controversial things I’ve posted on this blog; it sparked a lot of comments and a lot of inbound links from people arguing about the points raised here. Some of the points made in the comments to the original were:

  • The ForEach method does exist on lists, as I noted. Is there anything different about lists than sequences which makes it better on lists? My answer was that I found it still distasteful on lists, but with lists at least we can make the argument that lists are intended to be a mutable data structure, so maybe it makes more sense to have a method designed to have a side effect operating on a mutable data structure? But that’s a pretty weak argument because of course you’re still not supposed to mutate a list while you are iterating it!
  • A commenter pointed out that the list implementation of the ForEach method iterates by index in a for loop; ironically it does not use a foreach.
  • A ForEach extension method is allegedly more readable at the end of a long chain of sequence operators than a foreach statement is at the top.
  • A ForEach extension method that introduces different semantics than regular foreach would have a clear benefit — like AsParallel().ForEach() does. Good point, though of course that is a sharp tool.
  • You could make the extension method non-void by both causing a side effect on each element and then yielding it. There are usages for such a thing; in particular I have used that method for “dump this sequence to the debugger display”. It’s nice to be able to insert a debugging hook like that into the middle of a long chain of sequence operators that is going wrong. But I would still feel icky about putting it into production code.