Guidelines and rules for GetHashCode

The code is more what you’d call guidelines than actual rules” – truer words were never spoken. It’s important when writing code to understand what are vague “guidelines” that should be followed but can be broken or fudged, and what are crisp “rules” that have serious negative consequences for correctness and robustness. I often get questions about the rules and guidelines for GetHashCode, so I thought I might summarize them here.

What is GetHashCode used for?

It is by design useful for only one thing: putting an object in a hash table. Hence the name.

Why do we have this method on Object in the first place?

It makes perfect sense that every object in the type system should provide a GetType method; data’s ability to describe itself is a key feature of the CLR type system. And it makes sense that every object should have a ToString, so that it is able to print out a representation of itself as a string, for debugging purposes. It seems plausible that objects should be able to compare themselves to other objects for equality. But why should it be the case that every object should be able to hash itself for insertion into a hash table? Seems like an odd thing to require every object to be able to do.

I think if we were redesigning the type system from scratch today, hashing might be done differently, perhaps with an IHashable interface. But when the CLR type system was designed there were no generic types and therefore a general-purpose hash table needed to be able to store any object.

How do hash tables and similar data structures use GetHashCode?

Consider a “set” abstract data type. Though there are many operations you might want to perform on a set, the two basic ones are insert a new item into the set, and check to see whether a given item is in the set. We would like these operations to be fast even if the set is large. Consider, for example, using a list as an implementation detail of a set:

class Set
{
  private List list = new List();
  public void Insert(T item)
  {
    if (!Contains(t))
      list.Add(item);
  }
  public bool Contains(T item)
  {
    foreach(T member in list)
      if (member.Equals(item))
        return true;
    return false;
  }
}

(I’ve omitted any error checking throughout this article; we probably want to make sure that the item is not null. We probably want to implement some interfaces, and so on. I’m keeping things simple so that we concentrate on the hashing part.)

The test for containment here is linear; if there are ten thousand items in the list then we have to look at all ten thousand of them to determine that the object is not in the list. This does not scale well.

The trick is to trade a small amount of increased memory burden for a huge amount of increased speed. The idea is to make many shorter lists, called “buckets”, and then be clever about quickly working out which bucket we’re looking at:

class Set
{
  private List[] buckets = new List[100];
  public void Insert(T item)
  {
    int bucket = GetBucket(item.GetHashCode());
    if (Contains(item, bucket))
      return;
    if (buckets[bucket] == null)
    buckets[bucket] = new List();
    buckets[bucket].Add(item);
  } 
  public bool Contains(T item)
  {
    return Contains(item, GetBucket(item.GetHashCode());
  }
  private int GetBucket(int hashcode)
  {
    unchecked
    {
      // A hash code can be negative, and thus its remainder can be negative also.
      // Do the math in unsigned ints to be sure we stay positive.
      return (int)((uint)hashcode % (uint)buckets.Length);
    }
  }
  private bool Contains(T item, int bucket)
  {
    if (buckets[bucket] != null)
    foreach(T member in buckets[bucket])
      if (member.Equals(item))
        return true;
    return false;
  }
}

Now if we have ten thousand items in the set then we are looking through one of a hundred buckets, each with on average a hundred items; the Contains operation just got a hundred times cheaper.

On average.

We hope.

We could be even more clever here; just as a List resizes itself when it gets full, the bucket set could resize itself as well, to ensure that the average bucket length stays low. Also, for technical reasons it is often a good idea to make the bucket set length a prime number, rather than 100. There are plenty of improvements we could make to this hash table. But this quick sketch of a naive implementation of a hash table will do for now. I want to keep it simple.

By starting with the position that this code should work, we can deduce what the rules and guidelines must be for GetHashCode:

Rule: equal items have equal hashes

If two objects are equal then they must have the same hash code; or, equivalently, if two objects have different hash codes then they must be unequal.

The reasoning here is straightforward. Suppose two objects were equal but had different hash codes. If you put the first object in the set then it might be put into bucket #12. If you then ask the set whether the second object is a member, it might search bucket #67, and not find it.

Note that it is not a rule that if two objects have the same hash code, then they must be equal. There are only four billion or so possible hash codes, but obviously there are more than four billion possible objects. There are far more than four billion ten-character strings alone. Therefore there must be at least two unequal objects that share the same hash code, by the Pigeonhole Principle: if you have four billion pigeon holes to hold more than four billion pigeons then at least one pigeonhole has two pigeons.

Guideline: the integer returned by GetHashCode should never change

Ideally, the hash code of a mutable object should be computed from only fields which cannot mutate, and therefore the hash value of an object is the same for its entire lifetime.

However, this is only an ideal-situation guideline; the actual rule is:

Rule: the integer returned by GetHashCode must never change while the object is contained in a data structure that depends on the hash code remaining stable

It is permissible, though dangerous, to make an object whose hash code value can mutate as the fields of the object mutate. If you have such an object and you put it in a hash table then the code which mutates the object and the code which maintains the hash table are required to have some agreed-upon protocol that ensures that the object is not mutated while it is in the hash table. What that protocol looks like is up to you.

If an object’s hash code can mutate while it is in the hash table then clearly the Contains method stops working. You put the object in bucket #5, you mutate it, and when you ask the set whether it contains the mutated object, it looks in bucket #74 and doesn’t find it.

Remember, objects can be put into hash tables in ways that you didn’t expect. A lot of the LINQ sequence operators use hash tables internally. Don’t go dangerously mutating objects while enumerating a LINQ query that returns them!

Rule: Consumers of GetHashCode cannot rely upon it being stable over time or across appdomains

Suppose you have a Customer object that has a bunch of fields like Name, Address, and so on. If you make two such objects with exactly the same data in two different processes, they do not have to return the same hash code. If you make such an object on Tuesday in one process, shut it down, and run the program again on Wednesday, the hash codes can be different.

This has bitten people in the past. The documentation for System.String.GetHashCode notes specifically that two identical strings can have different hash codes in different versions of the CLR, and in fact they do. Don’t store string hashes in databases and expect them to be the same forever, because they won’t be.

Rule: GetHashCode must never throw an exception, and must return

Getting a hash code simply calculates an integer; there’s no reason why it should ever fail. An implementation of GetHashCode should be able to handle any legal configuration of the object.

I occasionally get the response “but I want to throw NotImplementedException in my GetHashCode to ensure that the object is never put into a hash table; I don’t intend for this object to ever be put into a hash table.”  Well, OK, but the last sentence of the previous guideline applies; this means that your object cannot be a result in many LINQ-to-objects queries that use hash tables internally for performance reasons.

Since it doesn’t throw an exception, it has to return a value eventually. It’s not legal or smart to make an implementation of GetHashCode that goes into an infinite loop.

This is particularly important when hashing objects that might be recursively defined and contain circular references. If hashing object Alpha hashes the value of property Beta, and hashing Beta turns right around and hashes Alpha, then you’re going to either loop forever (if you’re on an architecture that can optimize tail calls) or run out of stack and crash the process.

Guideline: the implementation of GetHashCode must be extremely fast

The whole point of GetHashCode is to optimize a lookup operation; if the call to GetHashCode is slower than looking through those ten thousand items one at a time, then you haven’t made a performance gain.

I classify this as a “guideline” and not a “rule” because it is so vague. How slow is too slow? That’s up to you to decide.

Guideline: the implementation of GetHashCode must be performant

Speed is only one kind of performance; consider a hash function that allocates a bunch of heap memory as it is computing a hash code. (For instance, by concatenating strings and hashing the resulting string, rather than hashing the strings separately and combining the results.) That generates collection pressure, which makes collections more likely, which maybe then slows down some crucial code that is going to run later.

Guideline: the distribution of hash codes must be “random”

By a “random distribution” I mean that if there are commonalities in the objects being hashed, there should not be similar commonalities in the hash codes produced. Suppose for example you are hashing an object that represents the latitude and longitude of a point. A set of such locations is highly likely to be “clustered”; odds are good that your set of locations is, say, mostly houses in the same city, or mostly valves in the same oil field, or whatever. If clustered data produces clustered hash values then that might decrease the number of buckets used and cause a performance problem when the bucket gets really big.

Again, I list this as a guideline rather than a rule because it is somewhat vague, not because it is unimportant. It’s very important. But since good distribution and good speed can be opposites, it’s important to find a good balance between the two.

I know this from deep, personal, painful experience. Over a decade ago I wrote a string hash algorithm for a table used by the msn.com backend servers. I thought it was a reasonably randomly distributed algorithm, but I made a mistake and it was not. It turned out that all of the one hundred thousand strings that are five characters long and contain only numbers were always hashed to one of five buckets, instead of any of the six hundred or so buckets that were available. The msn.com guys were using my table to attempt to do fast lookups of tens of thousands of US postal codes, all of which are strings of five digits. Between that and a threading bug in the same code, I wrecked the performance of an important page on msn.com; this was both costly and embarrassing. Data is sometimes heavily clustered, and a good hash algorithm will take that into account.

In particular, be careful of “xor”. It is very common to combine hash codes together by xoring them, but that is not necessarily a good thing. Suppose you have a data structure that contains strings for shipping address and home address. Even if the hash algorithm on the individual strings is really good, if the two strings are frequently the same then xoring their hashes together is frequently going to produce zero. “xor” can create or exacerbate distribution problems when there is redundancy in data structures.

Security issue: if the hashed data can be chosen by an attacker then you might have a problem on your hands

When I wrecked that page on msn.com it was an accident that the chosen data interacted poorly with my algorithm. But suppose the page was in fact collecting data from users and storing it in a hash table for server-side analysis. If the user is hostile and can deliberately craft huge amounts of data that always hashes to the same bucket then they can mount a denial-of-service attack against the server by making the server waste a lot of time looking through an unbalanced hash table. If that’s the situation you are in, consult an expert. It is possible to build hostile-data-resistant implementations of GetHashCode but doing so correctly and safely is a job for an expert in that field.

Security issue: do not use GetHashCode “off label”

GetHashCode is designed to do only one thing: balance a hash table. Do not use it for anything else. In particular:

and so on.

Getting all this stuff right is surprisingly tricky.

Never say never, part two

This is part two of a two-part series about determining whether the endpoint of a method is never reachable. Part one is here. A follow-up article is here.


Whether we have a “never” return type or not, we need to be able to determine when the end point of a method is unreachable for error reporting in methods that have non-void return type. The compiler is pretty clever about working that out; it can handle situations like

int M()
{
  try
  {
    while(true) N();
  }
  catch(Exception ex)
  {
    throw new WrappingException(ex);
  }
}

The compiler knows that N either throws or it doesn’t, and that if it doesn’t, then the try block never exits, and if it does, then either the construction of the exception throws, or the construction succeeds and the catch throws the new exception. No matter what, the end point of M is never reached.

However, the compiler is not infinitely clever. It is easy to fool it:
Continue reading

Looking inside a double

Occasionally when I’m debugging the compiler or responding to a user question I’ll need to quickly take apart the bits of a double-precision floating point number. Doing so is a bit of a pain, so I’ve whipped up some quick code that takes a double and tells you all the salient facts about it. I present it here, should you have any use for it yourself.[1. Note that this code was built for comfort, not speed; it is more than fast enough for my purposes so I’ve spent zero time optimizing it.]

Continue reading

What would Feynman do?

No one I know at Microsoft asks those godawful “lateral-thinking puzzle” interview questions anymore. Maybe someone still does, I don’t know. But rumour has it that a lot of companies are still following the Microsoft lead from the 1990s in their interviews. In that tradition, I present a sequel to Keith Michaels’ 2003 exercise in counterfactual reasoning. Once more, we dare to ask the question “how well would the late Nobel-Prize-winning physicist Dr. Richard P. Feynman do in a technical interview at a software company?

Continue reading

Strange but legal

“Can a property or method really be marked as both abstract and override?” one of my coworkers just asked me. My initial gut response was “of course not!” but as it turns out, the Roslyn codebase itself has a property getter marked as both abstract and override. (Which is why they were asking in the first place.)

I thought about it a bit more and reconsidered. This pattern is quite rare, but it is perfectly legal and even sensible. The way it came about in our codebase is that we have a large, very complex type hierarchy used to represent many different concepts in the compiler. Let’s call it “Thingy”:

abstract class Thingy 
{
  public virtual string Name { get { return ""; } } 
  ...

There are going to be a lot of subtypes of Thingy, and almost all of them will have an empty string for their name. Or null, or whatever; the point is not what exactly the value is, but rather that there is a sensible default name for almost everything in this enormous type hierarchy.

However, there is another abstract kind of Thingy, a FrobThingy, which always has a non-empty name. In order to prevent derived classes of FrobThingy from accidentally using the default implementation from the base class, we said:

abstract class FrobThingy : Thingy 
{   
  public abstract override string Name { get; } }
  ...

Now if you make a derived class BigFrobThingy, you know that you have to provide an implementation of Name for it because it will not compile if you don’t.

Curiouser and curiouser

Here’s a pattern you see all the time in C#:

class Frob : IComparable<Frob>

At first glance you might ask yourself why this is not a “circular” definition; after all, you’re not allowed to say class Frob : Frob(*). However, upon deeper reflection that makes perfect sense; a Frob is something that can be compared to another Frob. There’s not actually a real circularity there.

This pattern can be genericized further:

class SortedList<T> where T : IComparable<T>

Again, it might seem a bit circular to say that T is constrained to something that is in terms of T, but actually this is just the same as before. T is constrained to be something that can be compared to T. Frob is a legal type argument for a SortedList because one Frob can be compared to another Frob.

But this really hurts my brain:

class Blah<T> where T : Blah<T>

That appears to be circular in (at least) two ways. Is this really legal?

Yes it is legal, and it does have some legitimate uses. I see this pattern rather a lot(**). However, I personally don’t like it and I discourage its use.

This is a C# variation on what’s called the Curiously Recurring Template Pattern in C++, and I will leave it to my betters to explain its uses in that language. Essentially the pattern in C# is an attempt to enforce the usage of the CRTP.

So, why would you want to do that, and why do I object to it?

One reason why people want to do this is to enforce a particular constraint in a type hierarchy. Suppose we have

abstract class Animal
{
  public virtual void MakeFriends(Animal animal);
}

But that means that a Cat can make friends with a Dog, and that would be a crisis of Biblical proportions! (***) What we want to say is

abstract class Animal
{
  public virtual void MakeFriends(THISTYPE animal);
}

so that when Cat overrides MakeFriends, it can only override it with Cat.

Now, that immediately presents a problem in that we’ve just violated the Liskov Substitution Principle. We can no longer call a method on a variable of the abstract base type and have any confidence that type safety is maintained. Variance on formal parameter types has to be contravariance, not covariance, for it to be typesafe. And moreover, we simply don’t have that feature in the CLR type system.

But you can get close with the curious pattern:

abstract class Animal<T> where T : Animal<T>
{
  public virtual void MakeFriends(T animal);
}
class Cat : Animal<Cat>
{
  public override void MakeFriends(Cat cat) {}
}

and hey, we haven’t violated the LSP and we have guaranteed that a Cat can only make friends with a Cat. Beautiful.

Wait a minute… did we really guarantee that?

class EvilDog : Animal<Cat>
{
  public override void MakeFriends(Cat cat) { }
}

We have not guaranteed that a Cat can only make friends with a Cat; an EvilDog can make friends with a Cat too. The constraint only enforces that the type argument to Animal be good; how you use the resulting valid type is entirely up to you. You can use it for a base type of something else if you wish.

So that’s one good reason to avoid this pattern: because it doesn’t actually enforce the constraint you think it does. Everyone has to play along and agree that they’ll use the curiously recurring pattern the way it was intended to be used, rather than the evil dog way that it can be used.

The second reason to avoid this is simply because it bakes the noodle of anyone who reads the code. When I see List I have a very clear idea of what the relationship is between the List part — it means that there are going to be operations that add and remove things — and the Giraffe part — those operations are going to be on giraffes. When I see FuturesContract where T : LegalPolicy I understand that this type is intended to model a legal contract about a transaction in the future which has some particular controlling legal policy. But when I read Blah where T : Blah I have no intuitive idea of what the intended relationship is between Blah and any particular TIt seems like an abuse of a mechanism rather than the modeling of a concept from the program’s “business domain”.

All that said, in practice there are times when using this pattern really does pragmatically solve problems in ways that are hard to model otherwise in C#; it allows you to do a bit of an end-run around the fact that we don’t have covariant return types on virtual methods, and other shortcomings of the type system. That it does so in a manner that does not, strictly speaking, enforce every constraint you might like is unfortunate, but in realistic code, usually not a problem that prevents shipping the product.

My advice is to think very hard before you implement this sort of curious pattern in C#; do the benefits to the customer really outweigh the costs associated with the mental burden you’re placing on the code maintainers?


(*) Due to an unintentional omission, some past editions of the C# specification actually did not say that this was illegal! However, the compiler has always enforced it. In fact, the compiler has over-enforced it, sometimes accidentally catching non-cycles and marking them as cycles.

(**) Most frequently in emails asking “is this really legal?”

(***) Mass hysteria!