Comment commentary

A recent highly-voted-up question on Programmers asked what’s wrong with comments that explain complex code? I think quite a bit about how I comment my code but rather than posting an answer on Programmers I thought I’d blog about it a bit here. I already discussed this topic — HOLY GOODNESS TEN YEARS AGO wow I have been writing this blog for a long time —  and everything I said then still applies. But today thanks to Roslyn being open-sourced there is now a large corpus of my code on the internet so I can talk about my comment strategy in the context of real production code.

The code which does semantic analysis of user-defined explicit conversions in C# is a pretty good example.

To start with, notice that the code is in a partial class and every method is private. Because these are private implementation details of a larger class, not public API methods for consumption by Roslyn’s users, there are no XML doc comments anywhere.

Next, notice that every method has an absolutely immense name, like ComputeApplicableUserDefinedExplicitConversionSet. These names clearly explain what the method does using the jargon of the specification; if you want to know what “applicable” means, read the specification. By making the names of methods so verbose I can do things like search the entire source code text for “Applicable” and know that I’ll find all the methods that compute applicability. Codebases where the developers use abbrevs lk n emo 13yo drive me nuts; naming this method CompAppUDEC does no one any favours. Long descriptive names decrease the need for comments because the code reflects its purpose without comments.

Let’s look at some statistics. This file has 108 lines that are either blank or contain a single brace, 165 lines of comments, 15 lines of precondition assertions, and 122 lines of meaningful code. The comments-to-real-code ratio seems huge; more than one line of commentary per line of code! But if you take a closer look at those comments, you see that one line is the copyright, only three lines are single-line comments explaining the meaning of the next line of code, 39 lines are direct quotes from the specification, and the remaining 122 lines of comments are divided up into seven blocks, several of which are quite long — more than 40 lines in one case.

What’s really going on here should now be clear. The code is structured so that every method implements a particular paragraph of the specification. Each line of code that is motivated by a given line of the specification has that line quoted. This way I can go back and forth between the text of the specification and the text of the code easily. When the spec changes I can do a search on the changed line and then update the code and quoted spec line. And when I find a bug in the implementation, I can easily find the corresponding section in the specification.

The really big comments are places where the code deliberately diverges from the specification in order to maintain a bug from the previous version because that is precisely what needs a huge amount of explanation! Those big comments are there to ensure that no one maintaining the code in the future says “oh, this is a spec violation and therefore a bug, I will fix it”, and to ensure that the consequences of every violation of the specification can be clearly understood.

And of course by reading these comments and this code you probably now know far more about explicit conversion processing than any sane human being ought to know. This is one of the most confusing parts of the specification, so it is no wonder that the previous implementation was buggy, and that Roslyn has to be bug-for-bug compatible. People often ask why it is that Roslyn took so long; remember, I and my colleagues had to discover all those bugs, deduce what the previous compiler’s behavior was, and then copy that behavior into Roslyn, and then write huge numbers of regression tests to ensure that the bug stayed enshrined forever. Not to mention the arguments about whether the spec ought to be updated to match the compiler. It’s a lot of work rewriting a ten-year-old compiler in a new language!

60 thoughts on “Comment commentary

    • Well, first, you don’t know the future. Rewrites are expensive and so they usually only happen when it becomes very clear that maintaining the existing codebase is not going to be cost effective in meeting future needs. That was not the case in the C# 1.0 timeframe. Second, more specifically “after version 1” would have been a bad time to try to write a C# compiler in C#. Trying to add generic types to the CLR and C# language at the same time as rewriting the compiler sounds like a recipe for disaster.

      • C# is an interesting case since it seems that bootstrapping the compiler is one of the first thing language creators do. C# is probably the only mainstream compiled language that took so long to bootstrap.

        • It’s a complicated decision. Among the many factors considered were things like: significant portions of the C# compiler logic involving emitting metadata were written for the compiler team by the CLR team in a library that did not afford easy usage from managed code; of course, that could have been fixed with some effort, but it was points against a rewrite; effort is costly. More generally the C# team was frequently the “longest pole” in Visual Studio — that is, the team with the most work on the schedule in a given release — and that meant some large pressure against anything that would risk slipping the schedule for no immediately obvious user benefit.

  1. On a more serious note, I really like the idea of directly quoting the spec in the code so that it’s easy to switch back and forth. That seems like the best possible use of a comment: describing how the problem domain relates to the code.

    I really appreciate the time you spend describing the thought process behind the code (in this case the naming and comments).

    • I too like the easy correlation, but have found that “copy/paste” of the specification into the code is highly problematic in the face of changing specifications. When that is a consideration “linking” the two artifacts in a way that is easy to search and bi-directionally follow may be a better alternative.

  2. I started to use long method names in C# after I did 2 months iOS development in ObjC. Since then I wanted to see the smalltalk-like method call syntax in C#, too. Of course code gets by way much longer but if the code is good written it’s almost always clear what happens without any docs like reference documentation or comments! Actually the always named parameters and the square braces method call are my favorite feature in ObjC.

    Most of my fellow students don’t see that benefit and try to stick with short names as possible. Also the teacher sticks with that. Imaging a sample code with a very short method name and a comment right after the method head…

    • While I hate abbreviations and lack of clarity, I’ve dealt with code that has gone to the other extreme. I try to keep every line to 120 chars max, but some method names I’ve seen make that hard.

      So, I go with, “make the method name as short as possible, but no shorter.”

  3. Eric, do you know of any system, which could make the connection tighter between code, tests and spec? For example, “linking” together a test (or an assertion) with a code block, and a paragraph in spec somehow. So that, for example, when a test or an assertion fails, one could trace it to a violated spec paragraph and get more context about the issue. IDEs could also make use of this maybe to provide better contextual help.

      • “Only two percent of the world’s population is born to be super programmers. And only two percent of the population is born to be super writers. And Knuth is expecting everybody
        to be both.”
        Jon Bentley

        It is easy/convenient for a super writer/programmer to advocate Literate Programming. 🙂

        • Assuming that these numbers are correct, and that the two things are independent, then given the world’s current population, there should be almost 3 million people alive now who would like literate programming. Somehow this sounds a bit high to me, but I could be wrong.

    • A nice approach that I have mentioned before is the way Waldemar Horwat managed the ECMAScript 4 specification — though of course the process to produce the specification failed, the specification itself was really interesting.

      Waldemar first defined a simple meta-language, then wrote the specification in the meta-language, and then wrote an interpreter in Lisp for the meta-language. The result was that he could make an edit to the specification and then actually run the specification!

    • Lots and lots of previously-undiscovered bugs were discovered directly as a result of carefully writing the new compiler to be an exact-as-possible implementation of the specification.

      Some of them turned out to be simple omissions or minor errors in the specification which were dealt with by changing the specification. For example, we found some places where the spec said to consider the type of an expression, but not every expression has a type. By carefully implementing the spec we could then detect these situations and fix the spec.

      For some of them we took the breaking change and fixed the bug. There was a high bar for these; if the fix changed the behavior of code that we knew to exist “in the wild” then that was big points against fixing it.

      For many of them we enshrined the buggy behavior into the compiler and did not change the specification. My intention was to mark all of them with DELIBERATE SPEC VIOLATION in the semantic analyzer, so search the Roslyn codebase for that phrase if you want to see them. I’m sure I missed a few, but there are plenty to choose from!

  4. I really like the // SPEC: tag usage and function naming. Currently, in our internal coding standards, we have agreed that a variable should have at least 3 letters with the exception of i,j,k for iterator counters and a few legacy ones known very well in our codebase (ql for querylogic, for example).

    Your code has inspired me to update our internal style guide with the option that if a variable is mentioned in the specifications, the variable in code may be named the same, if and only if it is preceded by a // SPEC: comment to link the variable to the specification.

  5. Pingback: The Morning Brew - Chris Alcock » The Morning Brew #1690

  6. Looking at the code you linked, I’ve got a question unrelated to the article.

    Most of the time, you seem to explicitly write the type name in the code:

    HashSet _useSiteDiagnostics = useSiteDiagnostics;
    Conversion fromConversion = EncompassingExplicitConversion(sourceExpression, source, convertsFrom, ref useSiteDiagnostics);

    However, a couple of time you use the var keyword instead:

    var result = ClassifyStandardConversion(expr, a, b, ref useSiteDiagnostics);
    var d = ArrayBuilder.GetInstance();

    Is there a rationale behind your usage of the var keyword?

    • Yes, there are two conflicting rationales! 🙂

      The compiler infers the type of almost everything already. If x, y and z are ints then I think most developers would agree that adding casts to “(x + y) * z” which indicate that x, y, z, the result of the addition and the result of the multiplication are all integers. Rather, adding those type annotations would decrease the readability of the code. I feel the same way about types on local variables. The meaning of the local variable should be independent of its type and clear without it.

      However some of my coworkers on the Roslyn team had the opposite point of view: that when reading unfamiliar code, one of their main clues for understanding it is to have the types of local variables manifest in the code. Though I take their point, it seems odd to me that it is goodness to say that “foo” is a MyFoo in “MyFoo foo = GetFoo();” but badness to say that “bar” in “foo.bar.Blah()” is a MyBar. But some people felt strongly about this and so I tried to break my habit of using “var” and instead use manifest types.

      My code is therefore quite inconsistent in this matter, and I never did go back and clean it up to follow a consistent rule.

      • The principle I would favor would be that one should use `var` [and even be allowed to use `var` for public members] in cases where the expression is either:

        1. A `new` expression of the desired variable type

        2. A cast to the desired variable type (sometimes a cast would be needed either for reasons of syntax or legibility even if the variable type were explicitly specified, and specifying the type twice violates DRY).

        3. A method invoked using static-member syntax, whose return type is the same as class containing it [e.g. a call to Wizzo.Fnorble(), whose return type is Wizzo].

        If `var` declarations were required by a code-validation tool to match one of those patterns, then a simple examination of a `var` declaration in isolation would suffice to identify a single type which the variable would need to be in order to pass validation; a validator would need to examine factory methods to ensure that they returned the correct type, but a person examining code which had been validated wouldn’t have to examine the factory method to know what type it returned.

      • I have writing code for a project for a few years, in that Project the use of var is very inconsistent as well, was my first project with .Net 4, so initally it looked a lot like .Net 2, but, as times passed and my coding style evolved I started to use var more, and other features too, today with all generics, LINQ, anonymous and so not using var for almost every variable declaration is a no-brainer, the exact type is irrelevant, coding stops if someone have to look for it, and may change in near future, so it makes maintence harder

        • One element not covered…What is the desired behavior if the return type of the RHS changes? If you want the code to “automatically” work [provided there is semantic equivalence] then var is an option. If you want the code to break (compilation error) then explicit type is a requirement.

  7. Hi Eric,

    I think in this case the use of comments is really good – does your spec change much? Do you find yourself having to update comments even if the spec changes and there inst a noticeable bug?

  8. Pingback: Commenting Code | Coding Afro

  9. Pingback: Dew Drop – September 9, 2014 (#1851) | Morning Dew

  10. I don’t think it is good to put too much comments in code. Because it is also time consuming to read a bunch of comments instead of the code itself. We need to balance the code and its comments. The objective is to make things simple and clear.

  11. One point that sort of slides by here is the notion of using very long, full-word, descriptive identifiers as a documentary device. The fact that IntelliSense actually /works/ for C# development is the most significant factor in making this palatable. You type an identifier once, and then choose it from a list thereafter.

  12. I have no idea what the coder talking about in the long comments. What it looks like to me is the tail is wagging the dog. Time to take a couple steps back, chuck some bad code or bad libraries or bad something along with this beautifully documented mess and start over.

    If you have to write more comments than code, then something is wrong.

    • I think you have failed to grasp the entire point of this article. The long comments are there *precisely* because *something is wrong*. Namely, *there’s a bug in the original compiler that must be preserved to avoid breaking customers*. So yes, absolutely, if you have to write more comments than code, something is wrong. The comment is there to describe the wrongness precisely so that future developers understand that wrongness deeply.

      You suggest chucking something. If we were in a position where we could chuck out the bad behaviour, we would have, but we were not; our extensive research showed that there were real customers whose real programs would be really, really broken by chucking out the bad implementation of the previous compiler and replacing it with something simple, sensible and spec compliant. So your suggestion basically comes down to “prioritize making the code simple enough to not need comments over the needs of the customers”. This is not an attitude that delights customers. Remember, the code is a means to an end, not an end in itself.

      Now, if you don’t understand the comments, that is probably because you are not an expert who has spent months studying the C# specification. I wrote those comments for such an audience.

      • I know that some people are adamantly opposed to the idea of C# “dialects”, but it seems to me that a willingness to accept the idea of dialects (which could be requested by special markers in the code) could free languages from a great deal of historical baggage. If the spec said “when the XX option is enabled, these other rules apply”, then the code in the compiler to handle the quirks wouldn’t have to say “the spec is defective here”, but would simply link back to the spec. Further, the fact that a piece of code in the compiler existed to implement the old behavior would be made more clear *in the code itself* if e.g. the check for improper assignment of a “computed” zero to an enumerated type were preceded by “if (Options.StrictEnumZeroes &&…)”. Finally, even if it’s necessary for the compiler to support old code that was written under defective rules, that doesn’t mean that new code shouldn’t be developed using improved rules.

      • “Remember, the code is a means to an end, not an end in itself.”

        You need to take another step back, you are still not seeing the big picture. The end itself is a system that works for this customer. I just found out that other customers “require” the broken compiler, but their end is also a system, not broken code that requires a broken compiler. It sounds like you have a lot more to fix.

        The correct means to all these ends is a correct compiler and correct code..

          • I just read your bio and apologize. I am definitely not the right person to learn about corner cases by reading 40 line comments about any computer language. Computer language is nearly always incorrect as evidence by version X+1. The customer problems always trend messy and often involve legacy systems.

            But I would respectfully still disagree with the concept of writing about compiler matters as code comments. The structure of code is for organizing declarations and procedures. but the structure of a document is more flexible. Perhaps a separate document can be used to better organize and explain these cases and rationale behind them. Hypertext might be even better.

        • I write a lot of code targeting embedded micro-controllers, where compilers seem be particularly buggy or deviate from the language spec. My solution to deal with this is simply to write code in a language that is the intersection of what the specification says and what the compiler implements. If you write code that conforms to the language spec but not the compiler then it will be broken, and if you write code that conforms to the compiler but not the spec then your code isn’t portable across different compilers (or versions of the same compiler). Best is to write in the subset of the language that includes both. This would be what you might call “correct code”, since it conforms to the spec. I think most of the time it’s easy enough to write correct code if you write simple code – code that’s easy/unambiguous to read and doesn’t use any behavior that’s not immediately obvious.

          If everyone wrote “correct code” then it would be easy to fix the compiler to be correct as well without breaking people’s code, but I don’t think that’s the case. My experience is that most people just write whatever code appears to work. And even though I do try to write correct code myself, I can’t verify that it conforms to all versions of all compilers (including future versions). It’s in these cases that I want to feel like it’s still a safe to upgrade to a newer compiler without any unexpected changes in behavior.

          However, if I were involved in the development of C# I would probably put some strategy in place to consider fixing the compiler in the long term, so that version 50 doesn’t still have baggage from version 1. Perhaps by having the new compiler emit warnings where the code actually activates one of the deviations-from-the-spec, and having a business policy that we don’t need to be backwards compatible with versions more than 10 years old (because those people have then had 10 years of warnings telling them to fix their code before the compiler breaks it). Perhaps there is already some other strategy in place already which I’m not aware of.

          • Agree 100%. IMHO, languages evolution should if practical follow a path in which code for a reasonably-current version of the language is likely to either work identically in another or else refuse compilation outright, but be readily adaptable to yield the old behavior while compiling cleanly. For example, I would like to see C compilers migrate toward allowing octal numbers of the form 0123 only when used in compatibility mode, but defining another syntax which could be used instead (since octal numbers are occasionally useful). Code which needs to compile with either old or new compilers could write octal numbers as OCT(123) and then define OCT(x) as either __OCT_PREFIX##x when that is defined, or as 0##x when it isn’t.

            If a new prefix were included now, along with predefined macro __OCT_PREFIX, then it would be possible to deprecate the use of 0123 format and eventually disallow it except in “compatibility” mode. LIkewise, if a standard means were added to allow code to designate an eagerly-matched character sequence as a synonymous with backslash, the abominations known as trigraphs could be deprecated and eliminated entirely [while not every character set will contain a backslash, I think every practical character set which lacks the backslash contains at least one character outside the C character set which could be used in its place].

  13. When the source code were released I downloaded it and read a few random classes, I didn’t like the coding style I saw, I found it confusing and hard to follow, too much mutabillity and non-obvious, after checking the target code of this article I feel like reading source of another project, none of the code I read before looks like this one, there are similarities in general method organization (or responsability? I am not sure I am beeing clear here) but the code itself is very different, I liked how comments where placed, the way methods were defined and placed in the file were much easier to follow, what really surprised me is how coding style differed within a single team at MS.

  14. Allen Holub, whom I greatly respect, in his famous book ‘Enough Rope to Shoot Yourslef in The Foot’, which was my programming Scripture while I was still a C++-speaker, touches the subject of comments and self-documenting code. I highly recommend it.
    As a rule of thumb for commenting, I personally would use a code review session.When you go through your code with a fellow developer and suddenly feel the need to stop and explain what a particular fragment does–that’s where you need to put your explanation into a comment. And don’t write a novell there: just a couple of lines, please. 🙂
    And as for the closeness between the spec and the code: just use WF. A runnable diagram: how cool is that?

  15. By reading this article and the piece of code from Roselyn I learned a lot. Thank you for your time for writing this, you improve the skils of at least one programmer. I hope this will have an effect also on my colleagues.

  16. If we compare code to literature, something that is well written needs no further explanation, ie.you wouldn’t expect to see comments in a novel, clarifying the author’s intent. Code is, however, logical and not just purely descriptive, but that aspect can be handled by unit tests. The name of the test and the logic therein provides this extra level of documentation. I see comments as a last resort, where good naming and good testing have both failed to supply enough information to understand the code.

    • Explanatory footnotes are rare in narrative fiction, since they tend to “take the reader out of the story”. If a reader can an unfamiliar term’s meaning reasonably well, imperfections in meaning may detract from the reader’s enjoyment less than would a diversion to read a clarifying footnote. Non-fiction writing is a different story, however. Because different readers will have different levels of familiarity with material, footnotes can often be a useful way of providing explanatory text for readers that need it, without overly distracting those that don’t.

      While some people really dislike the idea of marking code with “telescoping sections”, I personally think it is underutilized. It’s obviously very important that code reviews examine all telescoping sections to ensure consistency, but effective tools for telescoping can allow code which is syntactically bulky but conceptually simple to be placed in the context where it belongs, rather than being pulled out into a separate method purely for visuals’ sake. Moving a section of code from the middle of one method into a method of its own allows it to be invoked from arbitrary contexts. If the piece of code would make sense in multiple contexts, that’s great. If an accurate description of the precise contexts where a piece of code would be appropriate would be longer than the code itself, however, and the only plausible context which would meet that description is the one where the code is actually used, moving the method to separate code would either require excessive verbosity (describing all the conditions that must be met by the context where it’s used) or reduce semantic precision (falsely implying that code might be useful in contexts where it would be unsuitable). Being able to simply collapse the code into a brief description of what it does *in the context where it appears* would be more helpful.

    • That remark is imprecise, true. A more nuanced way to express it would be to say that whether Roslyn is bug-for-bug compatible with previous compilers is considered on a per-bug basis, and many factors go into the decision of whether or not to take a breaking change. The default position that we started from was to implement the spec, and when encountering a discrepancy between the previous compiler and the spec, to default towards maintaining the existing behavior. If that default position turned out to be too costly, or if the bug was deemed to be unlikely to break real code, then we would consider fixing it. If we decided to maintain the bug then we tried to make it clear in the code that the deviation from the specification was deliberate.

      • Has you followed or been involved with any debates about introducing “dialects”? If by design or by accident old compilers implemented some behavior which was inconsistent with where people would like the language to go, but it’s possible that some code somewhere might rely upon it, I would think having options to enable certain quirky behaviors would help reduce the need to keep such behaviors as part of the mainstream language. You’ve blogged in the past about some times the C# team has been courageous enough to fix some major defects in the language, such as the behavior of closed-over loop variables or the += and -= operators on auto-events. Would such decisions have been easier or harder if policy was to support old behaviors with compiler options?

        • I think Roslyn should be well-known-bugs-free and each adhoc bug-not-fix should be enabled by specific compiler argument, e.g. /bug:1023,2390.
          As the result the majority would have bug-free compiler (what they really want and need), but the minority could opt-in into any bogus behavior they wish.
          But instead everybody were left without an option to opt-out.

          • “Make it an option” is an attractive choice but in my opinion a bad one. First, it *massively* increases the testing matrix, and testing is not cheap. Second, the number of users who would actually opt into any one of those behaviours is tiny. It’s an education problem: the vast majority of users have no incentive to care, and the ones who do care do not know that an option exists that could help them.

          • I expect a cheaper and easier to use behaviour would be, instead of switches for each bug, an overall switch that puts the version N compiler into N-1 mode. There’s something like this already in project settings, but I seem to recall from a previous post of yours, Eric, that it doesn’t actually do this exactly.

            This means that the backwards-compatibility mode could be tested as one unit of work, using the old testing assets, and documented as one feature that would be a lot easier for users to get a handle on.

          • @Joshua who I cannot reply to for some reason…

            If you want N-1 behaviour just compile using the previous version of the compiler. It’s not like they are going back in time to erase previous versions.

          • Unfortunately some versions of the compiler are “in place” upgrades; the C# 5.0 compiler installer overwrites the C# 4.0 compiler, but not the C# 3.0 compiler. I found this decision vexing.

          • Ben: We do use that technique. I have three versions of VS on my dev machine. However, this approach has its hassles. I just recently had to explain to a BA that we had to drop support for our old stuff because you can’t run VS 6 and VS 2013 on the same machine in a supported fashion, with any Windows OS.

            I left out the “Why? In the name of all that is holy, why?” stuff that was going on in my head.

  17. Pingback: Top 10 links of the Week #30 | 09/12/2014 | The SoftFluent Blog

  18. Pingback: A different copy-paste error | Fabulous adventures in coding

Leave a comment