The best advice I ever got

The super nice people over at InformIT[1. You may recall that they also recently asked me for my advice on good books for C# programmers. In the interests of full disclosure, I note that in my spare time I write and edit C# programming books for Addison-Wesley, which is owned by the same company that owns InformIT.] are running a series of short articles with the theme “the best advice I ever got”, which I think should prove to be an interesting series. They were kind enough to ask me to give an example of some good advice I got that helped my programming career, though as you’ll see it is not actually about programming at all.


Next time on FAIC: When is a cast not a cast?

About these ads

Foolish consistency is foolish

Once again today’s posting is presented as a dialogue, as is my wont.

Why is var sometimes required on an implicitly-typed local variable and sometimes illegal on an implicitly typed local variable?

That’s a good question but can you make it more precise? Start by listing the situations in which an implicitly-typed local variable either must or must not use var.

Sure. An implicitly-typed local variable must be declared with var in the following statements:

var x1 = whatever; 
for(var x2 = whatever; ; ) {} 
using(var x3 = whatever) {} 
foreach(var x4 in whatever) {}

And an implicitly-typed local variable must not be declared with var in the following expressions:

from c in customers select c.Name
customers.Select(c => c.Name)

In both cases it is not legal to put var before c, though it would be legal to say:

from Customer c in customers select c.Name 
customers.Select((Customer c) => c.Name)

Why is that?

Well, let me delay answering that by criticizing your question further. In the query expression and lambda expression cases, are those in fact implicitly typed locals in the first place?

Hmm, you’re right; technically neither of those cases have local variables. In the lambda case, that is a formal parameter. But a formal parameter behaves almost exactly like a local variable, so it seems reasonable to conflate the two in casual conversation. In the query expression, the compiler is going to syntactically transform the range variable into an untyped lambda formal parameter regardless of whether the range variable is typed or not.

Can you expand on that last point a bit?

Sure. When you say from Customer c in customers select c.Name that is not transformed by the compiler into customers.Select((Customer c) => c.Name) Rather, it is transformed into customers.Cast<Customer>().Select(c => c.Name)

Correct. Discussing why that is might be better left for another day.

Indeed; the point here is that regardless of whether a type appears in the query expression, the lambda expression in the transformed code is going to have an untyped formal parameter.

So now that we’ve clarified the situation, what is your question?

C# is inconsistent; var is required on an implicitly-typed local variable (regardless of the “container” of the local declaration), but var is illegal on an implicitly-typed lambda parameter (regardless of whether the lambda parameter is “natural” or is the result of a query expression transformation). Why is that?

You keep on asking “why?” questions, which I find difficult to answer because your question is secretly a “why not?” question. That is, the question you really mean to ask is “I have a notion of how the language obviously ought to have been designed; why is it not like that?”  But since I do not know what your notion is, it’s hard to compare its pros and cons to the actual feature that you find inconsistent.

The problem you are raising is one of inconsistency; I agree that you have identified a bona fide inconsistency, and I agree that a foolish, unnecessary inconsistency is bad design. The C# language is designed to be easy to comprehend; foolish inconsistencies work against comprehensibility. But what I don’t understand is how you think that inconsistency ought to have been addressed.

I can see three ways to address that inconsistency. First, make var required on lambdas and query expressions, so that it is consistently required. Second, make var illegal on all locals, so that it is consistently illegal. And third, make it optional everywhere. What is the real “why not?” question?

You’re right; I’ve identified an inconsistency but have not described how I think that inconsistency could be removed. I don’t know what my real “why not?” is so let’s look at all of them; what are the pros and cons of each?

Let’s start by considering the first: require var everywhere. That would then mean that you have to write:

from var c in customers 
join var o in orders...

instead of

from c in customers 
join o in orders...

And you have to write

customers.Select((var c) => c.Name)

instead of

customers.Select(c => c.Name)

This seems clunky. What function does adding var in these locations serve? It does not make the code any more readable; it makes it less readable. We have purchased consistency at a high cost here. The first option seems untenable.

Now consider the second option: make var illegal on all locals. That means that your earlier uses of var on locals would become:

x1 = whatever; 
for(x2 = whatever; ; ) {} 
using(x3 = whatever) {} 
foreach(x4 in whatever) {}

The last case presents no problem; we know that the variable declared in a foreach loop is always a new local variable. But in the other three cases we have just added a new feature to the language; we now have not just implicitly typed locals, we now have implicitly declared locals. Now all you need to do to declare a new local is to assign to an identifier you haven’t used before.

There are plenty of languages with implicitly declared locals, but it seems like a very “un-C#-ish” feature. That’s the sort of thing we see in languages like VB and VBScript, and even in them you have to make sure that Option Explicit is turned off. The price we pay for consistency here is very different, but it is still very high. I don’t think we want to pay that price.

The third option — make var optional everywhere– is just a special case of the second option; again, we end up introducing implicitly declared locals.

Design is the art of compromising amongst various incompatible design goals. In this case, consistency is a goal but it loses to more practical concerns. This would be a foolish consistency.

Is the fact that there is no good way out of this inconsistency an indication that there is a deeper design problem here?

Yes. When I gave three options for removing the inconsistency, you’ll notice that I made certain assumptions about what was and was not allowed to change. If we were willing to make larger changes, or if the original design team had made different decisions in C# 1.0, then we wouldn’t be in this mess in the first place. The deeper design problem here is that the fact that a local variable declaration has the form

type identifier ;

This is not a great statement syntax in the first place. Suppose C# 1.0 had instead said that a local variable must be declared like this:

var identifier : type ;

I see where you are going; JScript.NET does use that syntax, and makes the type annotation clause optional. And of course Visual Basic uses the moral equivalent of that syntax, with Dim instead of var and As instead of :.

That’s right. So in typing-required C# 1.0 we would have

var x1 : int = whatever; 
for(var x2 : int = whatever; ; ) {} 
using(var x3 : IDisposable = whatever) {} 
foreach(var x4 : int in whatever) {}

This is somewhat more verbose, yes. But this is very easy to parse, both by the compiler and the reader, and is probably more clear to the novice reader. It is crystal clear that a new local variable is being introduced by a statement. And it is also nicely reminiscent of base class declaration, which is a logically similar annotation.[1. You could have just as easily asked "Why does the constraining type come to the left of the identifier in a local, parameter, field, property, event, method and indexer, and to the right of the identifier in a class, struct, interface and type parameter?" Inconsistencies abound!]

Then in C# 3.0 we could introduce implicitly typed locals by simply making the annotation portion optional, and allowing all of the following statements and expressions:

var x1 = whatever; 
for(var x2  = whatever; ; ) {} 
using(var x3  = whatever) {} 
foreach(var x4 in whatever) {}
from c in customers select c.Name 
from c : Customer in customers select c.Name 
customers.Select(c => c.Name) 
customers.Select((c : Customer) => c.Name)

If this syntax has nicer properties then why didn’t you go with it in C# 1.0?

Because the designers wanted the syntax to be familiar to users of C and C++. So here we have one kind of consistency — consistency of experience for C programmers — leading a decade later to a problem of C# self-consistency.

The moral of the story is: good design requires being either impossibly far-sighted, or accepting that you’re going to have to live with some small inconsistencies as a language evolves over decades.


Next time on FAIC: The best advice I ever got.

Eric rambles on about C#, again

Rachel Roumeliotis, who amongst other things edits C# books for O’Reilly, recently did an interview with me where I ramble on about async/await, Roslyn, performance analysis as an engineering discipline, and some broad-strokes ideas for future language research areas. If you have sixteen minutes to burn, check it out! The O’Reilly Radar blog post is here, and the video has also been posted to YouTube here.

A couple things to mention here; first, I say in the video that we’ve shipped one preview release of Roslyn; in fact we have shipped two. The video was recorded before we had announced the new release. And second, I want to re-emphasize that the end bit where you get more of Eric’s musings about ideas for future language research areas are for your entertainment. We have not announced any product beyond Roslyn, and we are certainly making no promises whatsoever about the feature sets of unannounced, entirely hypothetical products. Enjoy!

Implementation-defined behaviour

As I’ve mentioned several times on this blog before, C# has been carefully designed to eliminate some of the “undefined behaviour” and “implementation-defined behaviour” that you see in languages like C and C++. But I’m getting ahead of myself; I should probably start by making a few definitions.

Traditionally we say that a programming language idiom has undefined behaviour if use of that idiom can have any effect whatsoever; it can work the way you expect it to or it can erase your hard disk or crash your machine. Moreover, the compiler author is under no obligation to warn you about the undefined behaviour. (And in fact, there are some languages in which programs that use “undefined behaviour” idioms are permitted by the language specification to crash the compiler!)

An example of undefined behaviour in C, C++ and C# is writing to a dereferenced pointer that you have no business writing to. Doing so might cause a fault that crashes the process. But the memory could be valid, and it could be an important data structure of the runtime software. It could be code that you are now overwriting with code that does something else. And hence, anything can happen; you’re rewriting the software that is running right now into software that does something else.

By contrast, an idiom that has implementation-defined behaviour is behaviour where the compiler author has several choices about how to implement the feature, and must choose one. As the name implies, implementation-defined behaviour is at least defined. For example, C# permits an implementation to throw an exception or produce a value when an integer division overflows, but the implementation must pick one. It cannot erase your hard disk.

All that said, for the rest of this article I’m not going to make a strong distinction between these two flavours of underspecified behaviour. The question I’m actually interested in addressing today is:

What are some of the factors that lead a language design committee to leave certain language idioms as undefined or implementation-defined behaviours?

The first major factor is: are there two existing implementations of the language in the marketplace that disagree on the behaviour of a particular program? If FooCorp’s compiler compiles M(A(), B()) as “call A, call B, call M”, and BarCorp’s compiler compiles it as “call B, call A, call M”, and neither is the “obviously correct” behaviour then there is strong incentive to the language design committee to say “you’re both right”, and make it implementation defined behaviour. Particularly this is the case if FooCorp and BarCorp both have representatives on the committee.

The next major factor is: does the feature naturally present many different possibilities for implementation, some of which are clearly better than others? For example, in C# the compiler’s analysis of a “query comprehension” expression is specified as “do a syntactic transformation into an equivalent program that does not have query comprehensions, and then analyze that program normally”. There is very little freedom for an implementation to do otherwise. For example, you and I both know that

from c in customers 
from o in orders 
where c.Id == o.CustomerId 
select new {c, o}

and

from c in customers 
join o in orders 
  on c.Id equals o.CustomerId 
select new {c, o}

are semantically the same, and that the latter is likely to be more efficient. But the C# compiler never, ever turns the first query expression syntax into a call to Join; it always turns it into calls to SelectMany and Where. The runtime implementation of those methods is of course fully within its rights to detect that the object returned by SelectMany is being passed to Where, and to build an optimized join if it sees fit, but the C# compiler does not make any assumptions like that. We wanted the query comprehension transformation to be syntactic; the smart optimizations can be in the runtime.

By contrast, the C# specification says that the foreach loop should be treated as the equivalent while loop inside a try block, but allows the implementation some flexibility. A C# compiler is permitted to say, for example “I know how to implement the loop semantics more efficiently over an array” and use the array’s indexing feature rather than converting the array to a sequence as the specification suggests it should. A C# implementation is permitted to skip calling GetEnumerator.

A third factor is: is the feature so complex that a detailed breakdown of its exact behaviour would be difficult or expensive to specify? The C# specification says very little indeed about how anonymous methods, lambda expressions, expression trees, dynamic calls, iterator blocks and async blocks are to be implemented; it merely describes the desired semantics and some restrictions on behaviour, and leaves the rest up to the implementation. Different implementations could reasonably do different codegen here and still get good behaviours.

A fourth factor is: does the feature impose a high burden on the compiler to analyze? For example, in C# if you have:

Func<int, int> f1 = (int x)=>x + 1; 
Func<int, int> f2 = (int x)=>x + 1; 
bool b = object.ReferenceEquals(f1, f2);

Suppose we were to require b to be true. How are you going to determine when two functions are “the same”? Doing an “intensionality” analysis — do the function bodies have the same content? — is hard, and doing an “extensionality” analysis — do the functions have the same results when given the same inputs? — is even harder. A language specification committee should seek to minimize the number of open research problems that an implementation team has to solve! In C# this is therefore left to be implementation-defined; a compiler can choose to make them reference equal or not at its discretion.

A fifth factor is: does the feature impose a high burden on the runtime environment?

For example, in C# dereferencing past the end of an array is well-defined; it produces an array-index-was-out-of-bounds exception. This feature can be implemented with a small — not zero, but small — cost at runtime. Calling an instance or virtual method with a null receiver is defined as producing a null-was-dereferenced exception; again, this can be implemented with a small, but non-zero cost. The benefit of eliminating the undefined behaviour pays for the small runtime cost. But the cost of determining if an arbitrary pointer in unsafe code is safe to dereference would have a large runtime cost, and so we do not do it; we move the burden of making that determination to the developer, who, after all, is the one who turned off the safety system in the first place.

A sixth factor is: does making the behaviour defined preclude some major optimization? For example, C# defines the ordering of side effects when observed from the thread that causes the side effects. But the behaviour of a program that observes side effects of one thread from another thread is implementation-defined except for a few “special” side effects. (Like a volatile write, or entering a lock.) If the C# language required that all threads observe the same side effects in the same order then we would have to restrict modern processors from doing their jobs efficiently; modern processors depend on out-of-order execution and sophisticated caching strategies to obtain their high level of performance.

Those are just a few factors that come to mind; there are of course many, many other factors that language design committees debate before making a feature “implementation defined” or “undefined”.

Persistence, façades and Roslyn’s red-green trees

We decided early in the Roslyn design process that the primary data structure that developers would use when analyzing code via Roslyn is the syntax tree. And thus one of the hardest parts of the early Roslyn design was figuring out how we were going to implement syntax tree nodes, and what information they would proffer up to the user. We would like to have a data structure that has the following characteristics:

  • Immutable.
  • The form of a tree.
  • Cheap access to parent nodes from child nodes.
  • Possible to map from a node in the tree to a character offset in the text.
  • Persistent.

By persistence I mean the ability to reuse most of the existing nodes in the tree when an edit is made to the text buffer. Since the nodes are immutable, there’s no barrier to reusing them, as I’ve discussed many times on this blog. We need this for performance; we cannot be re-parsing huge wodges of text every time you hit a key. We need to re-lex and re-parse only the portions of the tree that were affected by the edit[1. Determining what those portions of the tree are is quite tricky; I might blog about that at a later date. An edit that, for example, adds async to a method can cause the parse of await(foo); in the method body to change from an invocation to a usage of the await contextual keyword.], because we are potentially re-doing this analysis between every keystroke.

When you try to put all five of those things into one data structure you immediately run into problems:

  • How do you build a tree node in the first place? The parent and the child both refer to each other, and are immutable, so which one gets built first?
  • Supposing you manage to solve that problem: how do you make it persistent? You cannot re-use a child node in a different parent because that would involve telling the child that it has a new parent. But the child is immutable.
  • Supposing you manage to solve that problem: when you insert a new character into the edit buffer, the absolute position of every node that is mapped to a position after that point changes. This makes it very difficult to make a persistent data structure, because any edit can change the spans of most of the nodes!

But on the Roslyn team we routinely do impossible things. We actually do the impossible by keeping two parse trees. The “green” tree is immutable, persistent, has no parent references, is built “bottom-up”, and every node tracks its width but not its absolute position. When an edit happens we rebuild only the portions of the green tree that were affected by the edit, which is typically about O(log n) of the total parse nodes in the tree.

The “red” tree is an immutable façade that is built around the green tree; it is built “top-down” on demand and thrown away on every edit. It computes parent references by manufacturing them on demand as you descend through the tree from the top. It manufactures absolute positions by computing them from the widths, again, as you descend.

You, the consumer of the Roslyn API, only ever see the red tree; the green tree is an implementation detail. (And if you use the debugger to peer into the internal state of a parse node you’ll in fact see that there is a reference to another parse node in there of a different type; that’s the green tree node.)

Incidentally, these are called “red/green trees” because those were the whiteboard marker colours we used to draw the data structure in the design meeting. There’s no other meaning to the colours.

The benefit of this strategy is that we get all those great things: immutability, persistence, parent references, and so on. The cost is that this system is complex and can consume a lot of memory if the “red” façades get large. We are at present doing experiments to see if we can reduce some of the costs without losing the benefits.


Next time on FAIC: What precisely is “implementation-defined behaviour”?