Copy-paste defects

Continuing with my series of answers to questions that were asked during my webcast on Tuesday:

The copy-paste checker example you showed was interesting. I’ve heard that NASA disallows copy-pasting in code because it is so error prone; is this true?

For readers who did not attend the talk: my favourite Coverity checker looks for code where you cut some code from one place, pasted it in another, and then made a series of almost but not quite consistent edits. An example taken from real world code is:

if (!bounds.Contains(p))
{
  PointF newCenterPt = Utility.GetRectangleCenter(bounds);
  if (p.X > bounds.Right || p.Y < bounds.Left)
    newCenterPt.X = p.X;
  if (p.Y > bounds.Bottom || p.Y < bounds.Top)
    newCenterPt.Y = p.Y;
}

If I hadn’t told you that there was such a defect in this code, it might be quite difficult to spot. The “Y” code was written first, and then copy-pasted to the “X” code above it. “Bottom” was updated to “Right” and “Top” was updated to “Left”, and three of the four “Y”s were updated to “X”; the remaining “Y” makes the logic wrong.

This defect frequently comes up in the context shown, that is, code which is doing something to a rectangle. However, I’ve also seen this defect where instead of “X” and “Y” and so on, it was “offense”, “defense”, “dodge_to_left”, “throw_to_right”, and so on, in the context of an AI for a sportsball game; the effect of the defect was that sometimes a defending player would take the action that was sensible for an offensive player. The worst copy-paste error I’ve seen so far was in financial code that mixed up exchange rates, at one point getting the exchange rate between dollars and euros backwards. That was a potentially extraordinarily costly defect.

Anyway, to return to the actual question, I do not know if NASA has this rule, but it is a pretty good rule! Any current or former NASA programmers reading this, please post a comment.

In lieu of an answer I’ll tell you two related anecdotes.

Anecdote #1: The very first defect I ever discovered was a copy-paste defect. My sixth-grade teacher had some educational software for the Commodore PET computers we had in my elementary school, and the math-drill software was sometimes producing multiple choice questions with all wrong answers. He asked me to see if I could figure out why, knowing that I was interested in learning how software worked. I did not deduce the defect logically; rather, I noticed that there were two lines of code side by side that were identical, except that much like the code above, one of them had every X changed to a Y except the last one. I had no idea what the line of code did, but I changed the last X to a Y, ran the code again, and the bug went away, so I declared victory. Of course ten-year-old Eric was a “cargo cult programmer”, making changes by gut feel without understanding why they were correct. Today I would take a more logically well-founded approach to fixing such a defect, but the technique is still a pretty good one for defect identification, as we’ve seen here.

Anecdote #2: The Jet Propulsion Laboratory, which makes the software and hardware for many NASA probes such as the rover running around on Mars right now, is a Coverity customer. JPL is very hard core about their usage of the static analyzer. For instance, they have fixed every defect found by static analysis regardless of its age; many customers quite reasonably prioritize recently introduced defects over long-existing defects on the assumption that the ones that have been in the product for years are less likely to be deadly, since they’ve not killed anyone yet.

But the bit I find most interesting is how they handle false positives. A “false positive” is a defect that is reported, but the analysis is wrong; there really is no defect in the code. Most customers mark false positives as such in the triage tool, which then suppresses reporting that particular defect at that particular location again when it is found again on the next analysis. JPL doesn’t do that. Rather if static analysis finds a false positive then they change the code such that its correctness is maintained but the false positive defect is no longer reported. The reasoning for this is: if the code is so unclear that it is fooling the static analyzer, then it is also sufficiently unclear to fool the humans reading the code and making decisions based on its deduced semantics! This is a very “meta” way to use a static analyzer, and I love it.

21 thoughts on “Copy-paste defects

  1. Really like the last part about the “Even if it is correct, lets make it clearer”. Indeed! Even if it is correct doesn’t mean that it will always be as refactoring get introduced. By writting the code in a way that you don’t have to turn of the safety of the static analyser, you can keep your security harness when introducing your changes. Being correct is not enough, it must also be clear to anyone reviewing it, and that include the static analyser ^_^.

    • Around 2003 I said goodby to a co-worker whose only fault was that he had inherited a very ugly code that worked, and, being true-and-blue code warrior and perfectionist, rewrote it beautifully — and it stopped working, at the very wrong moment (o nthe eve of the customer demo). Since then — ain’t broke, don’t fix. Period.

        • Unit tests do not always guard very well against situations where outside code makes unjustified assumptions about the behavior of a class. For example, the specification for a hash table may allow for repeated enumerations to yield items in arbitrarily-different sequence, but a particular implementation might be designed such that consecutive repeated enumerations always yield the same sequence, and a consumer might (incorrectly) rely upon such behavior. Unless the person writing the unit-test code for the consumer thought to use a hash table that would return different sequences when enumerated twice consecutively, the fact that code would fail when used with such a hash table would likely go undiscovered unless or until it is used with a hash table which unexpectedly but legitimately returns items in different sequences on repeated enumerations.

        • Unit tests do not perform static analysis. Tests (either automated or manual) serve an entirely different function, just as code-reviews (either formal or informal) also serve an entirely different function.

  2. Great post. I find the false positives aspect quite interesting since it seems to be the biggest potential downside to using a static analyzer — not only do you spend time investigating and “fixing” the problem, but as Brandon points out it’s an opportunity to introduce a new defect.

    If you’re so inclined, I’d like to hear more about false positives, especially common issues and how they’re mitigated (both in the analyzer heuristics and in the code).

    Thanks!

  3. Conceptually, if a section of code would be worth copying and pasting, it should be worth migrating to its own routine. Unfortunately, in many cases such as your example, the parameter lists for the resulting method could easily be longer than the code itself.

    Personally, I’m a strong proponent of designing languages to avoid repetition. Sure, typing `p.Y` twice may not seem like much of a nuisance, but it leads to problems like the one here. In speaking, one wouldn’t describe the first-line’s [correctly-written] condition as “p.X is greater than Bounds.Right or P.X is less than Bounds.Left”; one would say “p.X is greater than Bounds.Right or less than Bounds.Left”. Adding support for variable declarations within an expression may ease such things. If the expression had been:

    if (var x=p.X; x > bounds.Right || y < bounds.Left)

    the defect would have been caught by the compiler [at least if there was no variable ‘y’ in scope].

    • C# will likely add support for that sort of expression in 6.0.

      The repetition could also be removed by using a helper method:

      !InOrder(bottom, y, top)

      Though personally I would have written the code shown as:

      !(bottom <= y && y <= top)

      Whenever I write that kind of disequality I try to make it look lexically like bottom < y < top, and by putting the two “y”s close to each other, the cut-n-paste error becomes a lot more obvious.

      • In Python you can actually write:

        if not bounds.Left <= x <= bounds.Right:

        I wish more languages did little things like that.

        The visual similarity between x and y doesn't help either. Something like m and k would stand out more that something was wrong.

      • Since C# has separated boolean from int, boolean less-than int is an invalid expression.
        I wonder if a later revision to C# could apply a meaning to this invalid operation and introduce a new ternary operator, (x LT y LT z) ?

        • I don’t know that one could do that without causing compatibility problems with code whose behavior is presently defined if obscure. What I would like to see in a language design, as a more general principle, would be for most operators to return a structure which contained the operands and defined the operation, and which could be implicitly converted to the destination type. For example, dividing a `long` by a `float` would yield a `OperatorDivide<Int64, Single>`, which would have implicit conversions to both `Single` and `Double`, with the conversion to `Double` being “preferred”. Adding two `int` values would yield a `OperatorAdd<Int32, Int32;>` which would be implicitly convertible to either `Int32` or `Int64` (conversion of the result to `Int64` would yield a correct 64-bit sum).

          If operators worked in such a fashion, then an overload for a less-than operator could accept an `OperatorLessThan<Int32, Int32> and `Int32` and yield Boolean by checking whether the second operand of the first operator was between the other two values. For such an approach to work efficiently without requiring a separately-defined type for every operand combination of interest, it would be necessary for C# to support generic operators, which it presently does not. Still, such an approach might be interesting in future language designs, though if `bool` supports relational operators (`false` is less than `true`) the meaning of `bool1 <= bool2 <= `bool3` might be ambiguous (should it allow FFF, FFT, FTT, and TTT, or should it allow FFF, FFT, FTT, TFF, and TFT?)

  4. Pingback: The Morning Brew - Chris Alcock » The Morning Brew #1654

  5. Pingback: Dew Drop – July 18, 2014 (#1817) | Morning Dew

  6. I’ve had good experience with fixing static analysis issues, even though they didn’t ‘need’ to be fixed – for example, an ineffective write diagnostic, which highlighted an error – an error that meant a mutex wasn’t locked when it should have been… I like static analysis…

  7. I’m honestly a bit surprised at how surprised people are by JPL’s tactic. Isn’t it a natural extension of “let’s tidy up our code and run with -Wall”?

  8. Since no one has responded to the question about NASA yet. I’ll put in my $0.02 worth. I’ve worked as a programmer on two NASA missions and nowhere that I know of was copy/paste expressly forbidden. That being said, I must also say that I never worked on flight software that was actually controlling a spacecraft, rocket, or lander/probe. I’ve only ever worked on ground data systems and analysis software. It’s entirely possible that in some areas, that is the rule.

    The closest I came to flight software is managing the team doing the communication protocols and interfaces to be used on-board SOFIA. We had pretty strict requirements on the code but there wasn’t anywhere I know of that expressly forbid copy/paste.

  9. Pingback: A different copy-paste error | Fabulous adventures in coding

  10. Pingback: Know your compiler warnings | Depth Infinity

Leave a comment