Math from scratch, part six: comparisons

Getting equality correct is surprisingly tricky in C#. The landscape is a bit of a confusing jumble:

  • The == and != operators are a syntactic sugar for static method calls, so it is dispatched based on the static (compile-time) type of both operands.
  • object.Equals(object) is a virtual method, so of course it dispatches on the runtime type of its receiver but not its argument. An override on a type T is required to handle any object as an argument, not just instances of T.
  • IEquatable<T>.Equals(T) by contrast takes a T.
  • For reference types, you’ve got to consider comparisons to null even if it is an invalid value.

That’s four ways to implement equality already and we haven’t even gotten into IComparable<T>.CompareTo(T), which also needs to be able to compute equality.

Moreover: it is legal and surprisingly common for a class to implement == and Equals inconsistently. The Framework Design Guidelines, in section 8.10 actually verges on self-contradiction: it says both that Equals and == should have exactly the same semantics, and then hedges later on the same page to say that for reference types it can be surprising if == means value instead of reference equality. I occasionally see codebases in which == means reference equality and Equals means value equality;[1. Though the reverse seems rare.] it is then very easy to cause a bug by using reference equality when you mean value equality.

So that’s the equality side of comparison; now let’s talk a bit about IComparable<T> and the inequality operators. Before we go on, if you want a quick refresher on how comparers work, check out my series of articles on buggy comparers.

OK, welcome back. Summing up:

  • There are nine ways to do a comparison in C#: < <= > >= == != object.Equals(object) IEquatable<T>.Equals(T) IComparable<T>.CompareTo(T)
  • Ideally these should all be consistent with each other. That is, if x == y is true then x < y is false but x <= y and x.Equals(y) are true and x.CompareTo(y) is zero, and so on.
  • All these operators should consider null as a valid value which is equal to itself and smaller than any other value. This way you can have a List<Natural> and sort it successfully even if there are nulls in there.
  • The CompareTo method returns a negative integer if the receiver is smaller than the argument, a positive integer if it is larger, and zero if they are equal. So I’m going to be stuck having to use int in my program after all, but I will only use three integers, -1, 0 and 1.
  • The comparison operators should produce a total order. Since the naturals obviously can be put into a consistent order, this will not be difficult.

The easiest way to get all this right in my opinion is to write a single static helper method that does the computation, and then nine entrypoints that simply call the helper method. I’ve made the class implement IEquatable<Natural> and IComparable<Natural>.  Here are my entrypoints:

public int CompareTo(Natural x) { return CompareTo(this, x); }
public static bool operator <(Natural x, Natural y) { return CompareTo(x, y) < 0; }
public static bool operator >(Natural x, Natural y) { return CompareTo(x, y) > 0; }
public static bool operator <=(Natural x, Natural y) { return CompareTo(x, y) <= 0; }
public static bool operator >=(Natural x, Natural y) { return CompareTo(x, y) >= 0; }
public static bool operator ==(Natural x, Natural y) { return CompareTo(x, y) == 0; }
public static bool operator !=(Natural x, Natural y) { return CompareTo(x, y) != 0; } 
public override bool Equals(object obj) { return CompareTo(this, obj as Natural) == 0; }
public bool Equals(Natural x) { return CompareTo(this, x) == 0; }

Notice that this gives you a nice mnemonic for remembering whether a negative number means that the receiver is smaller than the argument or vice versa: x OP y is equivalent to x.CompareTo(y) OP 0.

I don’t think I need to spell out all the base and recursive cases here; inequality is pretty straightforward.

// negative means x < y 
// positive means x > y 
// zero means x == y 
// two nulls are equal 
// otherwise, null is always smaller 
private static int CompareTo(Natural x, Natural y) { 
    if (ReferenceEquals(x, y)) 
        return 0; 
    else if (ReferenceEquals(x, null)) 
        return -1; 
    else if (ReferenceEquals(y, null)) 
        return 1; 
    else if (ReferenceEquals(x, Zero)) 
        return -1; 
    else if (ReferenceEquals(y, Zero)) 
        return 1; 
    else if (x.head == y.head) 
        return CompareTo(x.tail, y.tail); 
    else if (x.head == ZeroBit) 
        return CompareTo(x.tail, y.tail) > 0 ? 1 : -1; 
    else 
        return CompareTo(x.tail, y.tail) < 0 ? -1 : 1; 
}

C# likes it if you override GetHashCode when you override Equals. As I’ve discussed at length, it is important that any two objects that compare as equal have the same hash code. Since a hash code is a 32 bit integer, this is again of the few places in the series in which I’m forced to use a built-in integer type. Vexing, but I can deal.

It seems reasonable to use the value of the natural, truncated to 32 bits and subject to overflow, as the hash code, and it might come in handy for debugging purposes as well, so let’s make a Natural-to-integer converter:

public override int GetHashCode()
{
    return ToInteger();
}
private int ToInteger()
{
    if (ReferenceEquals(this, Zero))
        return 0;
    else
        return (this.head == ZeroBit ? 0 : 1) + 2 * this.tail.ToInteger();
}

Holy goodness, we got ten public entrypoints into Natural implemented in this episode. Next time on FAIC we’ll slow down a bit and consider just two as we discuss the division and remainder operators in depth.

About these ads

39 thoughts on “Math from scratch, part six: comparisons

  1. I occasionally see codebases in which == means reference equality and Equals means value equality

    Makes life simpler for Java immigrants.

    • Yeah, to me this is good and natural and how things should be. Otherwise, why have both an Equals() method and an operator?

      However, if you take the perspective that reference or value is an implementation detail that should not be exposed to the users of your class, or a maths class using maths operators just makes sense, then Eric’s approach makes sense too.

        • If S1 and S2 are variables of type String which both hold reference to 5,000 character strings, they can have three relationships:

          - They may identify the same instance of System.String

          - They may identify distinct instances of System.String which encapsulate the same sequence of characters

          - They may identify distinct instances of string which encapsulate different sequences of characters

          The first case can be determined quickly using ReferenceEquals. The third may require examining up to 5,000 characters (but will often require much fewer). The second case is the worst, as it will always require examining all 5,000 characters. While it isn’t strictly necessary to distinguish between the first two cases, many programs can benefit from trying to exploit the best-case behavior whenever possible.

          Further, types which encapsulate mutable entities generally have no meaningful concept of equality other than their identity. Code which works with mutable entities could use the virtual Object.Equals to test them for common identity, but testing for reference identity directly is faster than the virtual method dispatch that would be required to call Object.Equals(Object) and have it test for reference identity.

        • No, in the Java way == equals ReferenceEquals. Equals equals value equality. Java doesn’t have a ReferenceEquals because it would be redundant.

          I expect in C# ReferenceEquals was added because you can overload == to mean something different from reference equality. If you want to be sure you’re using reference equality regardless of how someone wrote their class, ReferenceEquals is the one you want.

  2. I occasionally see codebases in which == means reference equality and Equals means value equality

    I imagine part of that might be because for some (inexperienced) developers == “feels” like it belongs to the language whereas Equals() “feels” like it belongs to the object. However, there’s also MSDN which says ( http://msdn.microsoft.com/en-us/library/bsc2ak47.aspx ):

    The following guidelines apply to overriding Equals(Object) for a reference type:

    > Consider overriding Equals if the semantics of the type are based on the fact that the type represents some value(s).

    > Most reference types must not overload the equality operator, even if they override Equals. However, if you are implementing a reference type that is intended to have value semantics, such as a complex number type, you must override the equality operator.

    > You should not override Equals on a mutable reference type. This is because overriding Equals requires that you also override the GetHashCode method, as discussed in the previous section. This means that the hash code of an instance of a mutable reference type can change during its lifetime, which can cause the object to be lost in a hash table.

    A quick skim of that section could lead someone to believe that you should not implement ==, not implement Equals(), or do some combination of the two. That said, poor reading of the documentation is hardly a good excuse :)

    • My reading is that the guidelines support the idea that “== means reference equality and Equals means value equality”, plus a bit tacked on about how to not screw up hashtables. It looks to me like this notion is a carry-over from Java, which is all well and good.

      If you abandon it, then yes, the difference between the operator and the method seems arbitrary and pointlessly complex. I think that it just doesn’t gell well with the == operator overriding feature.

      If you can override ==, you don’t need or want Equals(), really.

    • The behavior of the equality operator in C# (i.e. ==) is such that no class type can overload it to implement any equivalence relation (other than the default reference identity) among all instances of that type. Assume F1 and F2 are variables of type Foo that identify distinct instances which may or may not be equivalent. Assume further that O1 and O2 are variables of type Object that are initialized to F1 and F2, respectively. No matter what overloads the type supports, O1==O2 will be false; any reasonable overloads should make F1==O1 and F2==O2 true (overloads for (Foo,Object) and (Object,Foo) could unconditionally returned false, but that would be very weird). The above relationships mean that that for == to behave as an equivalence relation, F1==O2, F2==O1, and F1==F2 must all return false regardless of whether F1 and F2 are “equivalent”.

      It might be possible for Foo to implement an equivalence relation in all cases which compile by defining overloads for (Foo,Object) and (Object,Foo) both of which are marked with ‘[Obsolete(True,"Cast both operands to Foo for value equality or Object for reference equality");]‘, but I’ve never seen such “Obsolete” overloads used.

      Since == cannot in C# implement any equivalence relation among class types other than reference equivalence, that is a strong argument for letting it mean precisely that. Because Equals is virtual rather than statically dispatched, it can implement equivalence relations among types where == cannot.

        • Since you appear to agree, but have overridden == anyway, I infer that you feel either that (i) casting Naturals to Object sucks, or (ii) equivalence relations suck, at least when you do.

          On a related note, I feel that Equals should always return false if the run-time types of the two objects differ. Some MSDN examples disagree, leading to further equivalence relation problems if a parent and a child class both implement Equals differently.

          • There are two type-related issues with `Equals`–one mentioned earlier is that there’s no way to tell a compiler not to apply type conversions to a type to satisfy a type-specific overload of `Equals`. The other is that there’s no means by which types that encapsulate values can avoid exposing certain details of their implementation.

            Consider, for example, an abstract ImmutableDoubleMatrix type, which has property getters for `int Width`, `int Height`, and `double this[int,int]`. One plausible implementation would be ImmutableArrayBackedDoubleMatrix, which would hold a `double [,]` and return its width, height, and contents. Another would be ImmutableIdentityDoubleMatrix which would simply hold a `int size` which it would report for `Width` and `Height`; `this[int r,int c]` property would return `r==c ? 1.0 : 0.0`. An ImmutableArrayBackedDoubleMatrixcontaining {{1.0,0.0},{0.0,1.0}} should semantically indistinguishable from an ImmutableIdentityDoubleMatrix of size 2. If the normal “GetType()` method could return “ImmutableDoubleMatrix” for both, they could be made indistinguishable. Unfortunately, the only way to make the illusion work is to have a proxy wrapper type which holds a reference to an ImmutableDoubleMatrix and conceals the actual thing to which it holds a reference. Performance could be mostly good if the proxy was a single-item struct, but unfortunately boxing would still result in an extra level of indirection.

  3. “Next time on FAIC we’ll slow down a bit…”

    Doing computations on these bits is pretty slow already! Fortunately you’re not worried about efficient code here, so you can slow down as many bits as you like.

  4. I write a lot of immutable objects, and it’s so much simpler to have .Equals() and == both measure value equality. String == is probably my #1 screw-up in java. If you really want to check reference equality, there’s a ReferenceEquals method that makes the intent of the code absolutely clear.

    • In Java, modern IDEs should be able to pick up on that and warn you.

      A few years ago, I wrote a String == bug. The Sun JVM was so good at interning strings, it never manifested. It was only when we tried to get the code running with another JVM that it ever showed.

      At that point my colleagues convinced me to turn that warning on. :)

  5. The .NET framework has relatively few contractual requirements for subclasses of Object, but one of them is that the `Equals(Object)` override should implement an equivalence relation. Unfortunately, the existence of implicit upcasts and type conversions makes it almost impossible for `x==y` to be an equivalence relation. If languages could specify cases in which implicit upcasts or type conversions should not be considered (e.g. the operands to certain `==` overloads) and went against the (IMHO misguided) IEEE-754 directive that NaN != NaN, then it would be possible to an equality operator which behaved like other means of equality testing in all cases where it compiled. Given the overloading behaviors that exist, though, the only way to avoid “astonishing” cases where `==` and `Equals` behave differently is to *expect* them to behave differently.

    As for why codebases use `==` to mean referentail equality and `Equals` for value equality, that’s easy: if a type *doesn’t* satisfy any overloads of `==` other than (object,object) then that operator will naturally behave as an equivalence relation when applied to instances of that type. If a type does overload `==`, making it behave as an equivalence relation is much harder.

    BTW, a bigger question in my book is why the creators of C# decided to use for its reference-comparison operator the same token as is used for the overloadable equality operator. In VB.NET, given some variables of type `String` and some of type `Object`, one may use the `=` operator only when comparing the variables of type String, and it will implement a value-based equivalence relation among those. One may use the `Is` operator with all class-type variables, and it will implement a reference-based equivalence relation. Trying to use `=` to compare a `String` and an `Object`, or even two variables of type `Object` won’t perform a reference comparison–it will simply refuse to compile.

    • Agreed on all points, except that NaN != NaN is great. NaN is not a particular value, it’s an absence of a value. 1.0/0.0 is NaN, and -1.0/0.0 is NaN, but 1.0/0.0 is not -1.0/0.0, in IEEE maths as well as real maths.

      It just makes for some confusing bugs when one forgets that.

      • The main problem with NaN is that, well, it is sort-of-the bottom of the domain of IEEE float-point numbers. It is less defined than any “normal” value. And if we require the comparison to be monotonic (which we actually HAVE to do), the conclusion is that “NaN == NaN” is neither True nor False, it is the bottom of the Boolean domain.

        Well, introducing an explicit bottom value is… unwise. You basically give yourself a toy divergency to be aware of.

        • What do you mean by “monotonic”, and in what sense is that more important than having at least one of the equality operators define an equivalence relation? In what sense should having (0.0/0.0+1.0)==(0.0/0.0+2.0) be any more “surprising” than (1.0E38+1.0==1.0E38+2.0)? Floating-point equality tests on computed values very seldom means that the actual numerical results of the computations were equal; rather, it means that the floating-point types involved were unable to recognize any difference between them.

          • @Joker_vD: The blog post to which you link suggests that comparisons involving NaN should perhaps yield a “Not A Boolean” if such a thing existed, but the author seems to conclude, just before “Final Observations”, that he is unaware of any argument for regarding the “NaB” as false.

            Your point about monotonicity is interesting in other ways, though, though, since I would regard a “double” as being more specific than a “float”, and regard implicit float->double observations as far more dubious than double->float (just about the only time implicit double->float comparisons should be considered dubious would be when the values are arguments to a comparison operator; if I had my druthers, direct float-double comparisons would be *forbidden* even though most other implicit conversions would be allowed.

        • Well, introducing an explicit bottom value is… unwise. You basically give yourself a toy divergency to be aware of.

          I’ve been looking for a definition of “bottom” but Google throws up a lot of irrelevant stuff.

          But your floating point circuitry has to return somethingwhen it doesn’t know the answer. The programming language it’s talking to may not support exceptions.

          • It’s not clear to me either what Joker_vD means. I am familiar with the concept of a bottom type: a bottom type is a return type which is even more void than void. A void-returning method never returns a value but returns control to its caller; a bottom-returning method doesn’t even do that. It either loops forever, or its last act is to throw or call another bottom-returning method. We considered adding a bottom type called never to ECMAScript 4 back in the day. I’m not sure what is meant by a bottom value though.

          • I think the meaning of bottom being used follows the usage in the blog post, which I am now reading.

          • It seems to me that the blog author is using bottom as a type but then isn’t completely clear when he is talking about types and when he is talking about values. I don’t think that obscures his point at the end of the day, but I do disagree with the point.

            While I agree that it would be great if NaN == NaN resulted in something that was not true or false, the designers of IEEE-754 would have won no friends if they started mucking around with the boolean type. Equally, the standard has to still be useful for people with a high concern for speed, people with no exceptions in their language, and people who don’t want to add another flag to their CPU to signal NaN conditions.

            Where I disagree with the author is that since NaN is conceptually not a value – that is the point of the thing – the fact that it breaks the reflexivity property of == isn’t such a drag. Floating-point == is about comparing values and NaN isn’t one. It’s a thing you just have to bear in mind, just as if you made NaN an exception you’d have to bear that in mind too.

          • “Bottom” is a usual name for the least element in a partially ordered set. Such sets are mathematical foundation of the denotational semantics theory, and a good short introduction to it is here: http://en.wikibooks.org/wiki/Haskell/Denotational_semantics

            And the key difference between “making NaN exception” and “making `NaN == x` return false for every x” is that in the first case, NaN *is* bottom, and “NaN == x” returns “_|_” (which is an exception, or a divergence, or something else which certainly disrupts the normal control flow).

          • Thanks for the clarification. I note that since types are a partially ordered set, the bottom type is the bottom element of a hierarchy of types. That is, the partial ordering relation is “is a subtype of”, and the bottom type is a subtype of every type, including void.

          • Apparently I can’t reply directly to the correct comment because of a nesting limit or something, but anyway to make sure everyone is appropriately confused about everything, there are actually two notions of a “bottom” type floating around:

            1) You can treat types as a category with total functions as arrows, in which case the initial object is the type with no values. This sometimes gets called a bottom type because in a poset seen as a category, the initial object is the bottom element.

            2) You can treat types as a poset in terms of subtyping, in which case the bottom element is a subtype of all other types, which will generally still be a type with no values.

            Since casting to a supertype is a total function, if all types are part of the subtyping relation the second is a special case of the first. In C# this probably fails because of fiddly little corner cases but I’m not 100% sure.

            Of course, for this to be useful with non-total functions you have to include an extra, non-value “value” of that type, such as null (for reference types) or NaN for floats. You also have to account for non-local control flow like exceptions. So even if you had a truly uninhabited type in C# (which would have to be a value type with no constructors, among other problems) you could still have functions that “return” that type by throwing an exception. You can then create a poset for each type in terms of being more or less “defined”, i.e. where they contain non-values like the above. (This is more relevant in Haskell where you can have un-forced error thunks floating around, but C# property getters that throw exceptions can bring the same sort of boundless joy) A bottom value is then the least defined value of its type.

            And of course an ostensibly empty type inhabited by a non-value is not empty at all, so you couldn’t define a bottom type (in the first sense above) in C# because any such type would contain bottom values.

            Further confusing matters, the “void” type in C# corresponds more closely to the unit type in ML-style languages, which is a type containing exactly one uninteresting value and is the terminal object in a category of types and total functions. Reference equality comparison is far too interesting, which means a unit type in C# would have to be a value type with only one value, such as a struct with no members.

            So, an empty struct is a top type matching the first sense of bottom type, while Object in C# is a top type for the second sense, and depending on what language you’re talking about a type called “Void” is either a top type, a bottom type, or a top type pretending to be a bottom type but failing because it’s inhabited by a bottom value.

            I could probably confuse things further with more explanations but at this point I’d really be scraping the bottom of the barrel. So to speak.

          • I think one problem with the notion of a “bottom” value is that if a test upon a true “bottom” value is used to conditionally execute or skip arbitrary code, there will generally be no semantically-correct way for the program to proceed. If code says `{ if (x==y) foo(); else bar(); boz();}`, then when `boz` executes the program will know whether `foo` executed, and whether `bar` executed. It’s possible that for some values or combinations of `x` and `y` the language could specify that both `foo` and `bar` should run, or that neither should run, but unless the language were to do something like take two snapshots of system state, run `foo` with one and `bar` with the other, and then mark as “bottom” any variables whose values differed in the two snapshots,. there would be no way to avoid leaking information about the bottom variable.

  6. Pingback: The Morning Brew - Chris Alcock » The Morning Brew #1458

  7. Thank you. Why on earth did they design a new language from the ground up with such a blurred definition of equality . Teaching students how to program becomes extremely difficult with cases such as this.

    • Languages for .NET would be greatly improved if Microsoft defined (and languages enforced) an attribute for function (method, operator, etc.) parameters which would restrict the implicit upcasts, boxing conversions, or other value conversions that could be considered for that parameter when performing overload resolution. Depending upon the exact attribute selected, a parameter should either require the exact type, a reference type, or else a type which could not satisfy any other overload with the same number of parameters, regardless of the types of those parameters. Much of the murkiness of equality tests stems from the lack of such an attribute. Applying such an attribute to Object.Equals(Object) as well as the operands of most == overloads, for example, would have meant that in cases where comparisons of values cast to one type would return true but comparisons cast to another would return false, mixed-type comparisons would be rejected, thus allowing the set of equality tests which compile to behave as an equivalence relation. Although compatibility with existing code may require a compiler option to ignore that attribute, many of the outlawed comparison forms are very seldom used correctly, and some are almost never used correctly (e.g. calls to `ReferenceEquals which boxes the arguments), so any compiler errors would represent code which should almost certainly at minimum be clarified if not fixed.

      • That’s an excellent answer, really appreciated. It’s the lack of consistency in obtaining equality here which I found the most alarming. I understand what Joshua is saying too but Microsoft did have a chance to really implement a very elegant solution to something as fundamental as equality. New developers in particular really really struggle with knowing what kind of equality to apply where.

    • No language was really designed from the ground up except Intercal. C# even less so than normal, IMHO.

      That could be your next project, Eric: a number library in Intercal. It would be different in that Intercal doesn’t already have one built-in.

  8. >> x OP y is equivalent to x.CompareTo(y) OP 0.

    I’ve been telling people to implement the comparison operators this way for years. Perhaps if I can cite you, I’ll get them to trust me….

  9. I agree that implementing the whole comparison logic in a single CompareTo method and then having boilerplate code for the rest is preferable (too bad that we need the boilerplate code – one situation where mixins are very useful!), but was there a specific reason that you picked a static helper method instead of just using the already necessary member CompareTo?

    Just personal taste or some deeper reasons? (Probably the first, but can’t harm to ask I hope)

  10. Pingback: Zebrane ciekawe linki z 7-14 października | Wiadomości o technologiach IT

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s