When is a cast not a cast?

I’m asked a lot of questions about conversion logic in C#, which is not that surprising. Conversions are common, and the rules are pretty complicated. Here’s some code I was asked about recently; I’ve stripped it down to its essence for clarity:

class C<T> {} 
class D 
{
  public static C<U> M<U>(C<bool> c)   
  { return =something=; } 
} 
public static class X 
{ 
  public static V Cast<V>(object obj) 
  { return (V)obj; } 
}

where there are three possible texts for “=something=“:

  1. (C<U>)c
  2. X.Cast<C<U>>(c);
  3. (C<U>)(object)c

Version 1 fails at compile time. Versions 2 and 3 succeed at compile time, and then fail at runtime if U is not bool.

Question: Why does the first version fail at compile time?

Because the compiler knows that the only way this conversion could possibly succeed is if U is bool, but U can be anything! The compiler assumes that most of the time U is not going to be constructed with bool, and therefore this code is almost certainly an error, and the compiler is bringing that fact to your attention.

Question: Then why does the second version succeed at compile time?

Because the compiler has no idea that a method named X.Cast<V> is going to perform a cast to V! All the compiler sees is a call to a method that takes an object, and you’ve given it an object, so the compiler’s work is done. The method is a “black box” from the caller’s perspective; the compiler does not look inside that box to see whether the mechanisms in that box are likely to fail given the input. This “cast” is not really a cast from the compiler’s perspective, it’s a method call.

Question: So what about the third version? Why does it not fail like the first version?

This one is actually the same thing as the second version; all we’ve done is inlined the body of the call to X.Cast<V>, including the intermediate conversion to object! That conversion is relevant.

Question: In both the second and third cases, the conversion succeeds at compile time because there is a conversion to object in the middle?

That’s right. The rule is: if there is a conversion from a type S to object, then there is an explicit conversion from object to S.[1. Of course it is not the case that there is a conversion from every type to object. There is no conversion from any pointer type to object, from the void return type to object, and there are also some special “typed reference” helper types that cannot be converted to object. Maybe I’ll discuss those in another episode of FAIC.]

By making a conversion to object before doing the “offensive” conversion, you are basically telling the compiler “please throw away the compile-time information you have about the type of the thing I am converting”. In the third version we do so explicitly; in the second version we do so sneakily, by making an implicit conversion to object when the argument is converted to the parameter type.

Question: So this explains why compile-time type checking doesn’t seem to work quite right on LINQ expressions?

Yes! You would think that the compiler would disallow nonsense like:

from bool b in new int[] { 123, 345 } select b.ToString()

because obviously there is no conversion from int to bool, so how can range variable b take on the values in the array? Nevertheless, this succeeds because the compiler translates this to

(new int[] { 123, 345 }).Cast<bool>().Select(b=>b.ToString())

and the compiler has no idea that passing a sequence of integers to the extension method Cast<bool> is going to fail at runtime. That method is a black box. You and I know that it is going to perform a cast, and that the cast is going to fail, but the compiler does not know that.

And maybe we do not actually know it either; perhaps we are using some library other than the default LINQ-to-objects query provider that does know how to make conversions between types that the C# language would not normally allow. This is actually an extensibility feature masquerading as a compiler deficiency: it’s not a bug, it’s a feature! [2. My glib statement here conveniently ignores that this method had quite a nasty bug in its initial release, a bug that was mostly my fault. Late in the game before the release one of the developers changed the implementation of the extension method so that it allowed more conversions than were specified, as a convenience to users. I reviewed the change while under the incorrect impression that the implemented behaviour was the specified behaviour. It was not, and the implementation was quite slow in a common code path. We took a breaking change in a service pack as a result. The cost of breaking a few people who might have been relying on the unintended behaviour was considered to be low compared to the cost to everyone of the slow implementation. This was a tough, controversial call but I think we did the right thing in the end. I regret the error.]


Next time on FAIC: Should C# warn on null dereferences known to the compiler?

Foolish consistency is foolish

Once again today’s posting is presented as a dialogue, as is my wont.

Why is var sometimes required on an implicitly-typed local variable and sometimes illegal on an implicitly typed local variable?

That’s a good question but can you make it more precise? Start by listing the situations in which an implicitly-typed local variable either must or must not use var.

Sure. An implicitly-typed local variable must be declared with var in the following statements:

Continue reading

Eric rambles on about C#, again

Rachel Roumeliotis, who amongst other things edits C# books for O’Reilly, recently did an interview with me where I ramble on about async/await, Roslyn, performance analysis as an engineering discipline, and some broad-strokes ideas for future language research areas. If you have sixteen minutes to burn, check it out! The O’Reilly Radar blog post is here, and the video has also been posted to YouTube here.

A couple things to mention here; first, I say in the video that we’ve shipped one preview release of Roslyn; in fact we have shipped two. The video was recorded before we had announced the new release. And second, I want to re-emphasize that the end bit where you get more of Eric’s musings about ideas for future language research areas are for your entertainment. We have not announced any product beyond Roslyn, and we are certainly making no promises whatsoever about the feature sets of unannounced, entirely hypothetical products. Enjoy!

Implementation-defined behaviour

As I’ve mentioned several times on this blog before, C# has been carefully designed to eliminate some of the “undefined behaviour” and “implementation-defined behaviour” that you see in languages like C and C++. But I’m getting ahead of myself; I should probably start by making a few definitions.

Traditionally we say that a programming language idiom has undefined behaviour if use of that idiom can have any effect whatsoever; it can work the way you expect it to or it can erase your hard disk or crash your machine. Moreover, the compiler author is under no obligation to warn you about the undefined behaviour. (And in fact, there are some languages in which programs that use “undefined behaviour” idioms are permitted by the language specification to crash the compiler!)

An example of undefined behaviour in C, C++ and C# is writing to a dereferenced pointer that you have no business writing to. Doing so might cause a fault that crashes the process. But the memory could be valid, and it could be an important data structure of the runtime software. It could be code that you are now overwriting with code that does something else. And hence, anything can happen; you’re rewriting the software that is running right now into software that does something else.

By contrast, an idiom that has implementation-defined behaviour is behaviour where the compiler author has several choices about how to implement the feature, and must choose one. As the name implies, implementation-defined behaviour is at least defined. For example, C# permits an implementation to throw an exception or produce a value when an integer division overflows, but the implementation must pick one. It cannot erase your hard disk.

All that said, for the rest of this article I’m not going to make a strong distinction between these two flavours of underspecified behaviour. The question I’m actually interested in addressing today is:

What are some of the factors that lead a language design committee to leave certain language idioms as undefined or implementation-defined behaviours?

The first major factor is: are there two existing implementations of the language in the marketplace that disagree on the behaviour of a particular program? If FooCorp’s compiler compiles M(A(), B()) as “call A, call B, call M”, and BarCorp’s compiler compiles it as “call B, call A, call M”, and neither is the “obviously correct” behaviour then there is strong incentive to the language design committee to say “you’re both right”, and make it implementation defined behaviour. Particularly this is the case if FooCorp and BarCorp both have representatives on the committee.

The next major factor is: does the feature naturally present many different possibilities for implementation, some of which are clearly better than others? For example, in C# the compiler’s analysis of a “query comprehension” expression is specified as “do a syntactic transformation into an equivalent program that does not have query comprehensions, and then analyze that program normally”. There is very little freedom for an implementation to do otherwise. For example, you and I both know that

from c in customers 
from o in orders 
where c.Id == o.CustomerId 
select new {c, o}

and

from c in customers 
join o in orders 
  on c.Id equals o.CustomerId 
select new {c, o}

are semantically the same, and that the latter is likely to be more efficient. But the C# compiler never, ever turns the first query expression syntax into a call to Join; it always turns it into calls to SelectMany and Where. The runtime implementation of those methods is of course fully within its rights to detect that the object returned by SelectMany is being passed to Where, and to build an optimized join if it sees fit, but the C# compiler does not make any assumptions like that. We wanted the query comprehension transformation to be syntactic; the smart optimizations can be in the runtime.

By contrast, the C# specification says that the foreach loop should be treated as the equivalent while loop inside a try block, but allows the implementation some flexibility. A C# compiler is permitted to say, for example “I know how to implement the loop semantics more efficiently over an array” and use the array’s indexing feature rather than converting the array to a sequence as the specification suggests it should. A C# implementation is permitted to skip calling GetEnumerator.

A third factor is: is the feature so complex that a detailed breakdown of its exact behaviour would be difficult or expensive to specify? The C# specification says very little indeed about how anonymous methods, lambda expressions, expression trees, dynamic calls, iterator blocks and async blocks are to be implemented; it merely describes the desired semantics and some restrictions on behaviour, and leaves the rest up to the implementation. Different implementations could reasonably do different codegen here and still get good behaviours.

A fourth factor is: does the feature impose a high burden on the compiler to analyze? For example, in C# if you have:

Func<int, int> f1 = (int x)=>x + 1; 
Func<int, int> f2 = (int x)=>x + 1; 
bool b = object.ReferenceEquals(f1, f2);

Suppose we were to require b to be true. How are you going to determine when two functions are “the same”? Doing an “intensionality” analysis — do the function bodies have the same content? — is hard, and doing an “extensionality” analysis — do the functions have the same results when given the same inputs? — is even harder. A language specification committee should seek to minimize the number of open research problems that an implementation team has to solve! In C# this is therefore left to be implementation-defined; a compiler can choose to make them reference equal or not at its discretion.

A fifth factor is: does the feature impose a high burden on the runtime environment?

For example, in C# dereferencing past the end of an array is well-defined; it produces an array-index-was-out-of-bounds exception. This feature can be implemented with a small — not zero, but small — cost at runtime. Calling an instance or virtual method with a null receiver is defined as producing a null-was-dereferenced exception; again, this can be implemented with a small, but non-zero cost. The benefit of eliminating the undefined behaviour pays for the small runtime cost. But the cost of determining if an arbitrary pointer in unsafe code is safe to dereference would have a large runtime cost, and so we do not do it; we move the burden of making that determination to the developer, who, after all, is the one who turned off the safety system in the first place.

A sixth factor is: does making the behaviour defined preclude some major optimization? For example, C# defines the ordering of side effects when observed from the thread that causes the side effects. But the behaviour of a program that observes side effects of one thread from another thread is implementation-defined except for a few “special” side effects. (Like a volatile write, or entering a lock.) If the C# language required that all threads observe the same side effects in the same order then we would have to restrict modern processors from doing their jobs efficiently; modern processors depend on out-of-order execution and sophisticated caching strategies to obtain their high level of performance.

Those are just a few factors that come to mind; there are of course many, many other factors that language design committees debate before making a feature “implementation defined” or “undefined”.

Persistence, façades and Roslyn’s red-green trees

We decided early in the Roslyn design process that the primary data structure that developers would use when analyzing code via Roslyn is the syntax tree. And thus one of the hardest parts of the early Roslyn design was figuring out how we were going to implement syntax tree nodes, and what information they would proffer up to the user. We would like to have a data structure that has the following characteristics:

  • Immutable.
  • The form of a tree.
  • Cheap access to parent nodes from child nodes.
  • Possible to map from a node in the tree to a character offset in the text.
  • Persistent.

Continue reading

Guid guide, part one

What is a GUID? The acronym stands for “globally unique identifier”; GUIDs are also called UUIDs, which stands for “universally unique identifier”.[1. It is unclear to me why we need two nigh-identical names for the same thing, but there you have it.] A GUID is essentially a 128 bit integer, and when written in its human-readable form, is written in hexadecimal in the pattern {xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx}.

The purpose of a GUID is, as the name implies, to uniquely identify something, so that we can refer to that thing by its identifier and have confidence that everyone can agree upon what thing we are referring to. Think about this problem as it applies to, say, books. It is cumbersome to refer to a book by quoting it in its entirety every time you mention it. Instead, we give every book an identifier in the form of its title. The problem with using a title as an identifier is that there may be many different books with the same title. I have three different books all entitled “The C# Programming Language” on my desk right now; if I want to refer to one of them in particular, I’d typically have to give the edition number. But there is nothing (apart from their good sense) stopping some entirely different publisher from also publishing a book called “The C# Programming Language, fourth edition” that differs from the others.

Continue reading

Null is not false, part two

In Raymond Smullyan‘s delightful books about the Island of Knights and Knaves — where, you’ll recall, knights make only true statements and knaves make only false statements — the knights and knaves are of course clever literary devices to explore problems in deductive logic.

(Aside: Smullyan’s book of combinatory logic puzzles, To Mock a Mockingbird is equally delightful and I recommend it for anyone who wants a playful introduction to the subject.)

Smullyan, to my recollection, never explores what happens when knights and knaves make statements which are disingenuous half-truths, authorial license in pursuit of a larger truth, or other forms of truthiness. A nullable Boolean in C# gives us, if not quite the notion of truthiness, at least the notion that true and false are not the only possible values of a predicate: there is also “null”, whatever that means.

What does that mean? A null Boolean can mean “there is a truth state, but I just don’t know what it is”: for example, if you queried a database on December 1st to ask “were the sales figures for November higher than they were in October?” the answer is either true or false, but the database might not know the answer because not all the figures are in yet. The right answer in that case would be to say “null”, meaning “there is an answer but I do not know what it is.”

Or, a null Boolean can mean “the question has no answer at all, not even true or false”. True or false: the present king of France is bald. The number of currently existing kings of France — zero — is equal to the number of currently existing bald kings of France, but it seems off-putting to say that a statement is “vacuously true” in this manner when we could more sensibly deny the validity of the question. There are certainly analogous situations in computer programming where we want to express the notion that the query is so malformed as to not have a truth value at all, and “null” seems like a sensible value in those cases.

Because null can mean “I don’t know”, almost every “lifted to nullable” operator in C# results in null if any operand is null. The sum of 123 and null is null because of course the answer to the question “what is the sum of 123 and something I don’t know” is “I don’t know!” The notable exceptions to this rule are equality, which says that two null values are equal, and the logical “and” and “or” operators, which have some very interesting behaviour. When you say x & y for nullable Booleans, the rule is not “if either is null then the result is null“. Rather, the rule is “if either is false then the result is false, otherwise, if either is null then the result is null, otherwise, the result is true“.

Similarly for x | y — the rule is “if either is true then the result is true, otherwise if either is null then the result is null, otherwise the result is false“. These rules obey our intuition about what “and” and “or” mean logically provided that “null” means “I don’t know”. That is the truth value of “(something true) or (something I don’t know)” is clearly true regardless of whether the thing you don’t know is true or false. But if “null” means “the question has no answer at all” then the truth value of “(something true) or (something that makes no sense)” probably should be “something that makes no sense”.

Things get weirder though when you start to consider the “short circuiting” operators, && and ||. As you probably know, the && and || operators on Booleans are just like the & and | operators, except that the && operator does not even bother to evaluate the right hand side if the left hand side is false, and the || operator does not evaluate the right hand side if the left hand side is true. After we’ve evaluated the left hand side of either operator, we might have enough information to know the final answer. We can therefore (1) save the expense of computing the other side, and (2) allow the evaluation of the right hand side to depend on a precondition established by the truth or falsity of the left hand side. The most common example of (2) is of course if (s == null || s.Length == 0) because the right hand side would have crashed and burned if evaluated when the left hand side is true.

The && and || operators are not “lifted to nullable” because doing so is problematic. The whole point of the short-circuiting operator is to avoid evaluating the right hand side, but we cannot do so and still match the behaviour of the unlifted version! Suppose we have x && y for two nullable Boolean expressions. Let’s break down all the cases:

  • x is false: We do not evaluate y, and the result is false.
  • x is true: We do evaluate y, and the result is the value of y
  • x is null: Now what do we do? We have two choices:
    • We evaluate y, violating the nice property that y is only evaluated if x is true. The result is false if y is false, null otherwise.
    • We do not evaluate y. The result must be either false or null.
      • If the result is false even though y would have evaluated to null, then we have resulted in false incorrectly.
      • If the result is null even though y would have evaluated to false, then we have resulted in null incorrectly.

In short, either we sometimes evaluate y when we shouldn’t, or we sometimes return a value that does not match the value that x & y would have produced. The way out of this dilemma is to cut the feature entirely.

I said last time that I’d talk about the role of operator true and operator false in C#, but I think I shall leave that to the next episode in this series. Next time on FAIC we’ll digress briefly and then conclude this series after that.

Null is not false, part one

The way you typically represent a “missing” or “invalid” value in C# is to use the “null” value of the type. Every reference type has a “null” value; that is, the reference that does not actually refer to anything. And every “normal” value type has a corresponding “nullable” value type which has a null value.

The way these concepts are implemented is completely different. A reference is typically implemented behind the scenes as a 32 or 64 bit number. As we’ve discussed previously, that number should logically be treated as an “opaque” handle that only the garbage collector knows about, but in practice that number is the offset into the virtual memory space of the process that the referred-to object lives at, inside the managed heap. The number zero is reserved as the representation of null because the operating system reserves the first few pages of virtual memory as invalid, always. There is no chance that by some accident, the zero address is going to be a valid address in the heap.

Continue reading